标签归档:pandas

将多个过滤器应用于pandas DataFrame或Series的有效方法

问题:将多个过滤器应用于pandas DataFrame或Series的有效方法

我有一种情况,用户想要将多个过滤器应用于Pandas DataFrame或Series对象。本质上,我想有效地将​​用户在运行时指定的一堆过滤(比较操作)链接在一起。

过滤器应为可加性的(又称每个过滤器应缩小结果)。

我当前正在使用,reindex()但这每次都会创建一个新对象并复制基础数据(如果我正确理解了文档)。因此,这在过滤大型Series或DataFrame时可能效率很低。

我认为使用apply()map()或类似的方法可能更好。我对Pandas来说还很陌生,因此仍然想尽一切办法。

TL; DR

我想采用以下形式的字典,并将每个操作应用于给定的Series对象,并返回一个“过滤后的” Series对象。

relops = {'>=': [1], '<=': [1]}

长例子

我将从当前的示例开始,仅过滤单个Series对象。以下是我当前正在使用的功能:

   def apply_relops(series, relops):
        """
        Pass dictionary of relational operators to perform on given series object
        """
        for op, vals in relops.iteritems():
            op_func = ops[op]
            for val in vals:
                filtered = op_func(series, val)
                series = series.reindex(series[filtered])
        return series

用户向字典提供他们要执行的操作:

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
   col1  col2
0     0    10
1     1    11
2     2    12

>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1       1
2       2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1       1
Name: col1

同样,上述方法的“问题”是,我认为中间步骤之间存在很多不必要的数据复制。

另外,我想对此进行扩展,以便传入的字典可以包括要操作的列,并根据输入字典过滤整个DataFrame。但是,我假设该系列的所有工作都可以轻松扩展到DataFrame。

I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object. Essentially, I want to efficiently chain a bunch of filtering (comparison operations) together that are specified at run-time by the user.

The filters should be additive (aka each one applied should narrow results).

I’m currently using reindex() but this creates a new object each time and copies the underlying data (if I understand the documentation correctly). So, this could be really inefficient when filtering a big Series or DataFrame.

I’m thinking that using apply(), map(), or something similar might be better. I’m pretty new to Pandas though so still trying to wrap my head around everything.

TL;DR

I want to take a dictionary of the following form and apply each operation to a given Series object and return a ‘filtered’ Series object.

relops = {'>=': [1], '<=': [1]}

Long Example

I’ll start with an example of what I have currently and just filtering a single Series object. Below is the function I’m currently using:

   def apply_relops(series, relops):
        """
        Pass dictionary of relational operators to perform on given series object
        """
        for op, vals in relops.iteritems():
            op_func = ops[op]
            for val in vals:
                filtered = op_func(series, val)
                series = series.reindex(series[filtered])
        return series

The user provides a dictionary with the operations they want to perform:

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
   col1  col2
0     0    10
1     1    11
2     2    12

>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1       1
2       2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1       1
Name: col1

Again, the ‘problem’ with my above approach is that I think there is a lot of possibly unnecessary copying of the data for the in-between steps.

Also, I would like to expand this so that the dictionary passed in can include the columns to operator on and filter an entire DataFrame based on the input dictionary. However, I’m assuming whatever works for the Series can be easily expanded to a DataFrame.


回答 0

熊猫(和numpy)允许使用布尔索引,这将更加高效:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

如果要为此编写辅助函数,请考虑以下方面:

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

更新:pandas 0.13针对此类用例提供了一种查询方法,假设列名是有效的标识符,则可以进行以下工作(并且对于大型框架,在幕后使用numexpr可能更为有效):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

回答 1

链接条件会产生长行,而pep8不建议这样做。使用.query方法会强制使用字符串,该字符串功能强大但不具有Python特色,而且不是很动态。

一旦每个过滤器都安装到位,一种方法是

import numpy as np
import functools
def conjunction(*conditions):
    return functools.reduce(np.logical_and, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[conjunction(c1,c2,c3)]

np.ologic可以运行并且运行很快,但是不超过两个参数,这些参数由functools.reduce处理。

请注意,这仍然有一些冗余:a)快捷方式不会在全局级别上发生b)每个单独条件都在整个初始数据上运行。不过,我希望它对于许多应用程序都足够有效,并且可读性强。

Chaining conditions creates long lines, which are discouraged by pep8. Using the .query method forces to use strings, which is powerful but unpythonic and not very dynamic.

Once each of the filters is in place, one approach is

import numpy as np
import functools
def conjunction(*conditions):
    return functools.reduce(np.logical_and, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[conjunction(c1,c2,c3)]

np.logical operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce.

Note that this still has some redundancies: a) shortcutting does not happen on a global level b) Each of the individual conditions runs on the whole initial data. Still, I expect this to be efficient enough for many applications and it is very readable.

You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or instead:

import numpy as np
import functools
def disjunction(*conditions):
    return functools.reduce(np.logical_or, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[disjunction(c1,c2,c3)]

回答 2

最简单的解决方案:

用:

filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]

另一个示例,要过滤数据框以查找属于Feb-2018的值,请使用以下代码

filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]

Simplest of All Solutions:

Use:

filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]

Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code

filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]

回答 3

由于pandas 0.22更新,提供了比较选项,例如:

  • gt(大于)
  • lt(小于)
  • eq(等于)
  • ne(不等于)
  • ge(大于或等于)

还有很多。这些函数返回布尔数组。让我们看看如何使用它们:

# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})

# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']

1    1
2    2
3    3
4    4
5    5

# where co11 values is better 0 and 2
df.loc[df['col1'].between(0,2)]

 col1 col2
0   0   10
1   1   11
2   2   12

# where col1 > 1
df.loc[df['col1'].gt(1)]

 col1 col2
2   2   12
3   3   13
4   4   14
5   5   15

Since pandas 0.22 update, comparison options are available like:

  • gt (greater than)
  • lt (lesser than)
  • eq (equals to)
  • ne (not equals to)
  • ge (greater than or equals to)

and many more. These functions return boolean array. Let’s see how we can use them:

# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})

# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']

1    1
2    2
3    3
4    4
5    5

# where co11 values is better 0 and 2
df.loc[df['col1'].between(0,2)]

 col1 col2
0   0   10
1   1   11
2   2   12

# where col1 > 1
df.loc[df['col1'].gt(1)]

 col1 col2
2   2   12
3   3   13
4   4   14
5   5   15

回答 4

为什么不这样做呢?

def filt_spec(df, col, val, op):
    import operator
    ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
    return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec

演示:

df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')

结果:

   a  b
 1  2  4
 2  3  3
 3  4  2
 4  5  1

您可以看到列“ a”已被过滤,其中> = 2。

这比操作员链接略快(键入时间,而不是性能)。您当然可以将导入文件放在文件顶部。

Why not do this?

def filt_spec(df, col, val, op):
    import operator
    ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
    return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec

Demo:

df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')

Result:

   a  b
 1  2  4
 2  3  3
 3  4  2
 4  5  1

You can see that column ‘a’ has been filtered where a >=2.

This is slightly faster (typing time, not performance) than operator chaining. You could of course put the import at the top of the file.


回答 5

e还可以基于不在列表中或不可迭代的列的值来选择行。我们将像以前一样创建布尔变量,但是现在我们将〜放在前面来否定布尔变量。

例如

list = [1, 0]
df[df.col1.isin(list)]

e can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front.

For example

list = [1, 0]
df[df.col1.isin(list)]

python pandas:将带有参数的函数应用于一系列

问题:python pandas:将带有参数的函数应用于一系列

我想将带有参数的函数应用于python pandas中的系列:

x = my_series.apply(my_function, more_arguments_1)
y = my_series.apply(my_function, more_arguments_2)
...

文档描述了对apply方法的支持,但不接受任何参数。是否存在接受参数的其他方法?另外,我是否缺少一个简单的解决方法?

更新(2017年10月): 请注意,由于最初询问此问题以来,apply()已对熊猫进行了更新以处理位置和关键字参数,并且上面的文档链接现在反映了这一点,并说明了如何包括这两种类型的参数。

I want to apply a function with arguments to a series in python pandas:

x = my_series.apply(my_function, more_arguments_1)
y = my_series.apply(my_function, more_arguments_2)
...

The documentation describes support for an apply method, but it doesn’t accept any arguments. Is there a different method that accepts arguments? Alternatively, am I missing a simple workaround?

Update (October 2017): Note that since this question was originally asked that pandas apply() has been updated to handle positional and keyword arguments and the documentation link above now reflects that and shows how to include either type of argument.


回答 0

较新版本的pandas 确实允许您传递额外的参数(请参阅新文档)。现在,您可以执行以下操作:

my_series.apply(your_function, args=(2,3,4), extra_kw=1)

位置参数添加系列元素之后


对于旧版本的熊猫:

文档对此进行了清晰的解释。apply方法接受应具有单个参数的python函数。如果要传递更多参数,则应functools.partial按照Joel Cornett在其评论中的建议使用。

一个例子:

>>> import functools
>>> import operator
>>> add_3 = functools.partial(operator.add,3)
>>> add_3(2)
5
>>> add_3(7)
10

您也可以使用传递关键字参数partial

另一种方法是创建一个lambda:

my_series.apply((lambda x: your_func(a,b,c,d,...,x)))

但我认为使用partial会更好。

Newer versions of pandas do allow you to pass extra arguments (see the new documentation). So now you can do:

my_series.apply(your_function, args=(2,3,4), extra_kw=1)

The positional arguments are added after the element of the series.


For older version of pandas:

The documentation explains this clearly. The apply method accepts a python function which should have a single parameter. If you want to pass more parameters you should use functools.partial as suggested by Joel Cornett in his comment.

An example:

>>> import functools
>>> import operator
>>> add_3 = functools.partial(operator.add,3)
>>> add_3(2)
5
>>> add_3(7)
10

You can also pass keyword arguments using partial.

Another way would be to create a lambda:

my_series.apply((lambda x: your_func(a,b,c,d,...,x)))

But I think using partial is better.


回答 1

脚步:

  1. 创建一个数据框
  2. 创建一个功能
  3. 在apply语句中使用函数的命名参数。

x=pd.DataFrame([1,2,3,4])  

def add(i1, i2):  
    return i1+i2

x.apply(add,i2=9)

此示例的结果是,数据框中的每个数字都将添加到数字9中。

    0
0  10
1  11
2  12
3  13

说明:

“添加”功能具有两个参数:i1,i2。第一个参数将是数据帧中的值,第二个参数是我们传递给“ apply”函数的值。在这种情况下,我们使用关键字参数“ i2”将“ 9”传递给apply函数。

Steps:

  1. Create a dataframe
  2. Create a function
  3. Use the named arguments of the function in the apply statement.

Example

x=pd.DataFrame([1,2,3,4])  

def add(i1, i2):  
    return i1+i2

x.apply(add,i2=9)

The outcome of this example is that each number in the dataframe will be added to the number 9.

    0
0  10
1  11
2  12
3  13

Explanation:

The “add” function has two parameters: i1, i2. The first parameter is going to be the value in data frame and the second is whatever we pass to the “apply” function. In this case, we are passing “9” to the apply function using the keyword argument “i2”.


回答 2

Series.apply(func, convert_dtype=True, args=(), **kwds)

args : tuple

x = my_series.apply(my_function, args = (arg1,))
Series.apply(func, convert_dtype=True, args=(), **kwds)

args : tuple

x = my_series.apply(my_function, args = (arg1,))

回答 3

您可以将任何数量的参数传递给apply正在通过未命名参数传递,作为元组传递给args参数或通过内部由关键字捕获为字典的其他关键字参数传递给函数的kwds函数。

例如,让我们构建一个函数,该函数对于3到6之间的值返回True,否则返回False。

s = pd.Series(np.random.randint(0,10, 10))
s

0    5
1    3
2    1
3    1
4    6
5    0
6    3
7    4
8    9
9    6
dtype: int64

s.apply(lambda x: x >= 3 and x <= 6)

0     True
1     True
2    False
3    False
4     True
5    False
6     True
7     True
8    False
9     True
dtype: bool

这个匿名函数不是很灵活。让我们创建一个带有两个参数的普通函数,以控制我们在系列中所需的最小值和最大值。

def between(x, low, high):
    return x >= low and x =< high

我们可以通过将未命名的参数传递给来复制第一个函数的输出args

s.apply(between, args=(3,6))

或者我们可以使用命名参数

s.apply(between, low=3, high=6)

或两者兼而有之

s.apply(between, args=(3,), high=6)

You can pass any number of arguments to the function that apply is calling through either unnamed arguments, passed as a tuple to the args parameter, or through other keyword arguments internally captured as a dictionary by the kwds parameter.

For instance, let’s build a function that returns True for values between 3 and 6, and False otherwise.

s = pd.Series(np.random.randint(0,10, 10))
s

0    5
1    3
2    1
3    1
4    6
5    0
6    3
7    4
8    9
9    6
dtype: int64

s.apply(lambda x: x >= 3 and x <= 6)

0     True
1     True
2    False
3    False
4     True
5    False
6     True
7     True
8    False
9     True
dtype: bool

This anonymous function isn’t very flexible. Let’s create a normal function with two arguments to control the min and max values we want in our Series.

def between(x, low, high):
    return x >= low and x =< high

We can replicate the output of the first function by passing unnamed arguments to args:

s.apply(between, args=(3,6))

Or we can use the named arguments

s.apply(between, low=3, high=6)

Or even a combination of both

s.apply(between, args=(3,), high=6)

熊猫与groupby占总数的百分比

问题:熊猫与groupby占总数的百分比

这显然很简单,但是作为一个笨拙的新手,我陷入了困境。

我有一个包含3列的CSV文件,分别是该办公室的州,办公室ID和销售。

我想计算给定状态下每个办公室的销售百分比(每个州的所有百分比的总和是100%)。

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': range(1, 7) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

返回:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

我似乎无法弄清楚如何“高达”的state水平groupby与总起来sales对整个state计算分数。

This is obviously simple, but as a numpy newbe I’m getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.

I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': range(1, 7) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

This returns:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

I can’t seem to figure out how to “reach up” to the state level of the groupby to total up the sales for the entire state to calculate the fraction.


回答 0

Paul H的答案是正确的,您将不得不创建第二个groupby对象,但是您可以以一种更简单的方式来计算百分比-只需groupbystate_office并除以该sales列的总和即可。复制Paul H答案的开头:

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

返回值:

                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

Paul H’s answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way — just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H’s answer:

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

Returns:

                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

回答 1

您需要创建另一个按状态分组的groupby对象,然后使用以下div方法:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100


                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

level='state'在kwarg div告诉大熊猫广播/加入关于该值的dataframes基地state索引的水平。

You need to make a second groupby object that groups by the states, and then use the div method:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100


                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.


回答 2

为简洁起见,我将使用SeriesGroupBy:

In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")

In [12]: c
Out[12]:
state  office_id
AZ     2            925105
       4            592852
       6            362198
CA     1            819164
       3            743055
       5            292885
CO     1            525994
       3            338378
       5            490335
WA     2            623380
       4            441560
       6            451428
Name: count, dtype: int64

In [13]: c / c.groupby(level=0).sum()
Out[13]:
state  office_id
AZ     2            0.492037
       4            0.315321
       6            0.192643
CA     1            0.441573
       3            0.400546
       5            0.157881
CO     1            0.388271
       3            0.249779
       5            0.361949
WA     2            0.411101
       4            0.291196
       6            0.297703
Name: count, dtype: float64

对于多个组,您必须使用transform(使用Radical的df):

In [21]: c =  df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")

In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1  Group 2  Final Group
AAHQ     BOSC     OWON           0.331006
                  TLAM           0.668994
         MQVF     BWSI           0.288961
                  FXZM           0.711039
         ODWV     NFCH           0.262395
...
Name: count, dtype: float64

这似乎比其他答案要好一些(对我来说,这只是Radical答案速度的两倍还不到0.08s)。

For conciseness I’d use the SeriesGroupBy:

In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")

In [12]: c
Out[12]:
state  office_id
AZ     2            925105
       4            592852
       6            362198
CA     1            819164
       3            743055
       5            292885
CO     1            525994
       3            338378
       5            490335
WA     2            623380
       4            441560
       6            451428
Name: count, dtype: int64

In [13]: c / c.groupby(level=0).sum()
Out[13]:
state  office_id
AZ     2            0.492037
       4            0.315321
       6            0.192643
CA     1            0.441573
       3            0.400546
       5            0.157881
CO     1            0.388271
       3            0.249779
       5            0.361949
WA     2            0.411101
       4            0.291196
       6            0.297703
Name: count, dtype: float64

For multiple groups you have to use transform (using Radical’s df):

In [21]: c =  df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")

In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1  Group 2  Final Group
AAHQ     BOSC     OWON           0.331006
                  TLAM           0.668994
         MQVF     BWSI           0.288961
                  FXZM           0.711039
         ODWV     NFCH           0.262395
...
Name: count, dtype: float64

This seems to be slightly more performant than the other answers (just less than twice the speed of Radical’s answer, for me ~0.08s).


回答 3

我认为这需要进行基准测试。使用OP的原始DataFrame,

df = pd.DataFrame({
    'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
    'office_id': range(1, 7) * 2,
    'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})

第一安迪·海登

如对他的回答的评论所述,Andy充分利用了矢量化和熊猫索引的优势。

c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()

每个循环3.42 ms ±16.7 µs
(平均±标准偏差,共运行7次,每个循环100个)


第二保罗H

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100

每个循环4.66 ms ±24.4 µs
(平均±标准偏差,共运行7次,每个循环100个)


第三exp1orer

这是最慢的答案,因为它x.sum()针对x级别0中的每个答案进行计算。

对我来说,尽管不是目前的形式,这仍然是一个有用的答案。为了在较小的数据集上apply实现快速EDA,允许您使用方法链接将其写在一行中。因此,我们无需决定变量的名称,而实际上这在计算上非常昂贵对于您最宝贵的资源(您的大脑!)来说。

这是修改,

(
    df.groupby(['state', 'office_id'])
    .agg({'sales': 'sum'})
    .groupby(level=0)
    .apply(lambda x: 100 * x / float(x.sum()))
)

每个循环10.6 ms ±81.5 µs
(平均±标准偏差,共运行7次,每个循环100个)


因此,在小型数据集上,没有人会关心6ms。但是,这是3倍的速度,并且在具有高基数groupbys的较大数据集上,这将产生巨大的差异。

添加到上面的代码中,我们制作一个形状为(12,000,000,3)的DataFrame,其中包含14412个状态类别和600个office_id,

import string

import numpy as np
import pandas as pd
np.random.seed(0)

groups = [
    ''.join(i) for i in zip(
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
                       )
]

df = pd.DataFrame({'state': groups * 400,
               'office_id': list(range(1, 601)) * 20000,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)] * 1000000
})

使用安迪的

每个循环2 s ±10.4毫秒
(平均±标准偏差,共运行7次,每个循环1次)

和exp1orer

每个循环19 s ±77.1 ms
(平均±标准偏差,共运行7次,每个循环1次)

因此,现在我们看到x10在大型,高基数的数据集上速度加快。


如果要紫外线这三个答案,一定要紫外线这三个答案!

I think this needs benchmarking. Using OP’s original DataFrame,

df = pd.DataFrame({
    'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
    'office_id': range(1, 7) * 2,
    'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})

1st Andy Hayden

As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.

c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()

3.42 ms ± 16.7 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)


2nd Paul H

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100

4.66 ms ± 24.4 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)


3rd exp1orer

This is the slowest answer as it calculates x.sum() for each x in level 0.

For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable’s name, which is actually very computationally expensive for your most valuable resource (your brain!!).

Here is the modification,

(
    df.groupby(['state', 'office_id'])
    .agg({'sales': 'sum'})
    .groupby(level=0)
    .apply(lambda x: 100 * x / float(x.sum()))
)

10.6 ms ± 81.5 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)


So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.

Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,

import string

import numpy as np
import pandas as pd
np.random.seed(0)

groups = [
    ''.join(i) for i in zip(
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
                       )
]

df = pd.DataFrame({'state': groups * 400,
               'office_id': list(range(1, 601)) * 20000,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)] * 1000000
})

Using Andy’s,

2 s ± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)

and exp1orer

19 s ± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)

So now we see x10 speed up on large, high cardinality datasets.


Be sure to UV these three answers if you UV this one!!


回答 4

(此解决方案的灵感来自本文https://pbpython.com/pandas_transform.html

我发现以下解决方案是最简单的(也许是最快的)解决方案transformation

转换:虽然聚合必须返回数据的精简版本,但转换可以返回完整数据的某些转换版本以进行重组。对于这种变换,输出与输入的形状相同。

因此,使用transformation,解决方案是1-liner:

df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')

如果您打印:

print(df.sort_values(['state', 'office_id']).reset_index(drop=True))

   state  office_id   sales          %
0     AZ          2  195197   9.844309
1     AZ          4  877890  44.274352
2     AZ          6  909754  45.881339
3     CA          1  614752  50.415708
4     CA          3  395340  32.421767
5     CA          5  209274  17.162525
6     CO          1  549430  42.659629
7     CO          3  457514  35.522956
8     CO          5  280995  21.817415
9     WA          2  828238  35.696929
10    WA          4  719366  31.004563
11    WA          6  772590  33.298509

(This solution is inspired from this article https://pbpython.com/pandas_transform.html)

I find the following solution to be the simplest(and probably the fastest) using transformation:

Transformation: While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input.

So using transformation, the solution is 1-liner:

df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')

And if you print:

print(df.sort_values(['state', 'office_id']).reset_index(drop=True))

   state  office_id   sales          %
0     AZ          2  195197   9.844309
1     AZ          4  877890  44.274352
2     AZ          6  909754  45.881339
3     CA          1  614752  50.415708
4     CA          3  395340  32.421767
5     CA          5  209274  17.162525
6     CO          1  549430  42.659629
7     CO          3  457514  35.522956
8     CO          5  280995  21.817415
9     WA          2  828238  35.696929
10    WA          4  719366  31.004563
11    WA          6  772590  33.298509

回答 5

我知道这是一个古老的问题,但是对于具有大量唯一组的数据集,exp1orer的答案非常慢(可能是由于lambda)。我建立了他们的答案,将其转换为数组计算,因此现在超级快!下面是示例代码:

创建具有50,000个唯一组的测试数据框

import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)

# This is the total number of groups to be created
NumberOfGroups = 50000

# Create a lot of groups (random strings of 4 letters)
Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]

# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]

# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
                   'Group 2': Group2,
                   'Final Group': FinalGroup,
                   'Numbers I want as percents': NumbersForPercents})

分组后,它看起来像:

                             Numbers I want as percents
Group 1 Group 2 Final Group                            
AAAH    AQYR    RMCH                                847
                XDCL                                182
        DQGO    ALVF                                132
                AVPH                                894
        OVGH    NVOO                                650
                VKQP                                857
        VNLY    HYFW                                884
                MOYH                                469
        XOOC    GIDS                                168
                HTOY                                544
AACE    HNXU    RAXK                                243
                YZNK                                750
        NOYI    NYGC                                399
                ZYCI                                614
        QKGK    CRLF                                520
                UXNA                                970
        TXAR    MLNB                                356
                NMFJ                                904
        VQYG    NPON                                504
                QPKQ                                948
...
[50000 rows x 1 columns]

查找百分比的数组方法:

# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)

此方法大约需要0.15秒

最佳答案方法(使用lambda函数):

state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))

此方法大约需要21秒才能产生相同的结果。

结果:

      Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
0        AAAH    AQYR        RMCH                         847               82.312925
1        AAAH    AQYR        XDCL                         182               17.687075
2        AAAH    DQGO        ALVF                         132               12.865497
3        AAAH    DQGO        AVPH                         894               87.134503
4        AAAH    OVGH        NVOO                         650               43.132050
5        AAAH    OVGH        VKQP                         857               56.867950
6        AAAH    VNLY        HYFW                         884               65.336290
7        AAAH    VNLY        MOYH                         469               34.663710
8        AAAH    XOOC        GIDS                         168               23.595506
9        AAAH    XOOC        HTOY                         544               76.404494

I know that this is an old question, but exp1orer’s answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it’s super fast! Below is the example code:

Create the test dataframe with 50,000 unique groups

import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)

# This is the total number of groups to be created
NumberOfGroups = 50000

# Create a lot of groups (random strings of 4 letters)
Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]

# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]

# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
                   'Group 2': Group2,
                   'Final Group': FinalGroup,
                   'Numbers I want as percents': NumbersForPercents})

When grouped it looks like:

                             Numbers I want as percents
Group 1 Group 2 Final Group                            
AAAH    AQYR    RMCH                                847
                XDCL                                182
        DQGO    ALVF                                132
                AVPH                                894
        OVGH    NVOO                                650
                VKQP                                857
        VNLY    HYFW                                884
                MOYH                                469
        XOOC    GIDS                                168
                HTOY                                544
AACE    HNXU    RAXK                                243
                YZNK                                750
        NOYI    NYGC                                399
                ZYCI                                614
        QKGK    CRLF                                520
                UXNA                                970
        TXAR    MLNB                                356
                NMFJ                                904
        VQYG    NPON                                504
                QPKQ                                948
...
[50000 rows x 1 columns]

Array method of finding percentage:

# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)

This method takes about ~0.15 seconds

Top answer method (using lambda function):

state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))

This method takes about ~21 seconds to produce the same result.

The result:

      Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
0        AAAH    AQYR        RMCH                         847               82.312925
1        AAAH    AQYR        XDCL                         182               17.687075
2        AAAH    DQGO        ALVF                         132               12.865497
3        AAAH    DQGO        AVPH                         894               87.134503
4        AAAH    OVGH        NVOO                         650               43.132050
5        AAAH    OVGH        VKQP                         857               56.867950
6        AAAH    VNLY        HYFW                         884               65.336290
7        AAAH    VNLY        MOYH                         469               34.663710
8        AAAH    XOOC        GIDS                         168               23.595506
9        AAAH    XOOC        HTOY                         544               76.404494

回答 6

我知道这里已经有了很好的答案。

尽管如此,我还是想贡献自己的力量,因为我觉得这样的基本问题很简单,因此应该有一个简短的解决方案,一目了然。

它也应该以可以将百分比添加为新列的方式工作,而其余数据框保持不变。最后但并非最不重要的一点是,它应该以明显的方式推广到存在多个分组级别的情况(例如,州和国家而不是仅州)。

以下代码段满足这些条件:

df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())

请注意,如果您仍在使用Python 2,则必须用float(x)替换lambda项的分母中的x。

I realize there are already good answers here.

I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.

It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched. Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (e.g., state and country instead of only state).

The following snippet fulfills these criteria:

df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())

Note that if you’re still using Python 2, you’ll have to replace the x in the denominator of the lambda term by float(x).


回答 7

查找跨列或索引百分比的最优雅的方法是使用 pd.crosstab

样本数据

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

输出数据框是这样的

print(df)

        state   office_id   sales
    0   CA  1   764505
    1   WA  2   313980
    2   CO  3   558645
    3   AZ  4   883433
    4   CA  5   301244
    5   WA  6   752009
    6   CO  1   457208
    7   AZ  2   259657
    8   CA  3   584471
    9   WA  4   122358
    10  CO  5   721845
    11  AZ  6   136928

只需指定索引,列和要聚合的值即可。normalize关键字将根据上下文计算跨索引或列的百分比。

result = pd.crosstab(index=df['state'], 
                     columns=df['office_id'], 
                     values=df['sales'], 
                     aggfunc='sum', 
                     normalize='index').applymap('{:.2f}%'.format)




print(result)
office_id   1   2   3   4   5   6
state                       
AZ  0.00%   0.20%   0.00%   0.69%   0.00%   0.11%
CA  0.46%   0.00%   0.35%   0.00%   0.18%   0.00%
CO  0.26%   0.00%   0.32%   0.00%   0.42%   0.00%
WA  0.00%   0.26%   0.00%   0.10%   0.00%   0.63%

The most elegant way to find percentages across columns or index is to use pd.crosstab.

Sample Data

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

The output dataframe is like this

print(df)

        state   office_id   sales
    0   CA  1   764505
    1   WA  2   313980
    2   CO  3   558645
    3   AZ  4   883433
    4   CA  5   301244
    5   WA  6   752009
    6   CO  1   457208
    7   AZ  2   259657
    8   CA  3   584471
    9   WA  4   122358
    10  CO  5   721845
    11  AZ  6   136928

Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.

result = pd.crosstab(index=df['state'], 
                     columns=df['office_id'], 
                     values=df['sales'], 
                     aggfunc='sum', 
                     normalize='index').applymap('{:.2f}%'.format)




print(result)
office_id   1   2   3   4   5   6
state                       
AZ  0.00%   0.20%   0.00%   0.69%   0.00%   0.11%
CA  0.46%   0.00%   0.35%   0.00%   0.18%   0.00%
CO  0.26%   0.00%   0.32%   0.00%   0.42%   0.00%
WA  0.00%   0.26%   0.00%   0.10%   0.00%   0.63%

回答 8

你可以sum在整个DataFrame再除以state总数:

# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

df

退货

    office_id   sales state  sales_ratio
0           1  405711    CA     0.193319
1           2  535829    WA     0.347072
2           3  217952    CO     0.198743
3           4  252315    AZ     0.192500
4           5  982371    CA     0.468094
5           6  459783    WA     0.297815
6           1  404137    CO     0.368519
7           2  222579    AZ     0.169814
8           3  710581    CA     0.338587
9           4  548242    WA     0.355113
10          5  474564    CO     0.432739
11          6  835831    AZ     0.637686

但是请注意,这仅是有效的,因为除state数字以外的所有其他列均允许对整个DataFrame求和。例如,如果office_id是字符,则会出现错误:

df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

TypeError:/:’str’和’str’不支持的操作数类型

You can sum the whole DataFrame and divide by the state total:

# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

df

Returns

    office_id   sales state  sales_ratio
0           1  405711    CA     0.193319
1           2  535829    WA     0.347072
2           3  217952    CO     0.198743
3           4  252315    AZ     0.192500
4           5  982371    CA     0.468094
5           6  459783    WA     0.297815
6           1  404137    CO     0.368519
7           2  222579    AZ     0.169814
8           3  710581    CA     0.338587
9           4  548242    WA     0.355113
10          5  474564    CO     0.432739
11          6  835831    AZ     0.637686

But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:

df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

TypeError: unsupported operand type(s) for /: ‘str’ and ‘str’


回答 9

我认为这可以在1行中达到目的:

df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

I think this would do the trick in 1 line:

df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

回答 10

我使用的简单方法是在2个groupby之后合并,然后进行简单除法。

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])

   state  office_id  sales_x  sales_y  sales_ratio
0     AZ          2   222579  1310725    16.981365
1     AZ          4   252315  1310725    19.250033
2     AZ          6   835831  1310725    63.768601
3     CA          1   405711  2098663    19.331879
4     CA          3   710581  2098663    33.858747
5     CA          5   982371  2098663    46.809373
6     CO          1   404137  1096653    36.851857
7     CO          3   217952  1096653    19.874290
8     CO          5   474564  1096653    43.273852
9     WA          2   535829  1543854    34.707233
10    WA          4   548242  1543854    35.511259
11    WA          6   459783  1543854    29.781508

Simple way I have used is a merge after the 2 groupby’s then doing simple division.

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])

   state  office_id  sales_x  sales_y  sales_ratio
0     AZ          2   222579  1310725    16.981365
1     AZ          4   252315  1310725    19.250033
2     AZ          6   835831  1310725    63.768601
3     CA          1   405711  2098663    19.331879
4     CA          3   710581  2098663    33.858747
5     CA          5   982371  2098663    46.809373
6     CO          1   404137  1096653    36.851857
7     CO          3   217952  1096653    19.874290
8     CO          5   474564  1096653    43.273852
9     WA          2   535829  1543854    34.707233
10    WA          4   548242  1543854    35.511259
11    WA          6   459783  1543854    29.781508

回答 11

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)]})

grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()

返回值:

sales
state   office_id   
AZ  2   54.587910
    4   33.009225
    6   12.402865
CA  1   32.046582
    3   44.937684
    5   23.015735
CO  1   21.099989
    3   31.848658
    5   47.051353
WA  2   43.882790
    4   10.265275
    6   45.851935
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)]})

grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()

Returns:

sales
state   office_id   
AZ  2   54.587910
    4   33.009225
    6   12.402865
CA  1   32.046582
    3   44.937684
    5   23.015735
CO  1   21.099989
    3   31.848658
    5   47.051353
WA  2   43.882790
    4   10.265275
    6   45.851935

回答 12

作为一个也在学习熊猫的人,我发现其他答案有些隐含,因为熊猫将大部分工作隐藏在幕后。即通过自动匹配列名和索引名来确定操作的工作方式。此代码应等效于@ exp1orer接受的答案的逐步版本

使用df,我将通过别名来称呼它state_office_sales

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

state_total_salesstate_office_sales由总金额在分组index level 0(最左边)。

In:   state_total_sales = df.groupby(level=0).sum()
      state_total_sales

Out: 
       sales
state   
AZ     2448009
CA     2832270
CO     1495486
WA     595859

因为这两个数据帧共享一个索引名和一个列名,熊猫将通过共享索引找到合适的位置,例如:

In:   state_office_sales / state_total_sales

Out:  

                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          0.288022
        3          0.322169
        5          0.389809
CO      1          0.206684
        3          0.357891
        5          0.435425
WA      2          0.321689
        4          0.346325
        6          0.331986

为了更好地说明这一点,这里是部分总计,XX其中没有等效项。大熊猫将根据索引和列名称匹配位置,在没有重叠的大熊猫将忽略它:

In:   partial_total = pd.DataFrame(
                      data   =  {'sales' : [2448009, 595859, 99999]},
                      index  =             ['AZ',    'WA',   'XX' ]
                      )
      partial_total.index.name = 'state'


Out:  
         sales
state
AZ       2448009
WA       595859
XX       99999
In:   state_office_sales / partial_total

Out: 
                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          NaN
        3          NaN
        5          NaN
CO      1          NaN
        3          NaN
        5          NaN
WA      2          0.321689
        4          0.346325
        6          0.331986

当没有共享索引或列时,这一点变得非常清楚。这missing_index_totals等于state_total_sales除了它有没有索引名。

In:   missing_index_totals = state_total_sales.rename_axis("")
      missing_index_totals

Out:  
       sales
AZ     2448009
CA     2832270
CO     1495486
WA     595859
In:   state_office_sales / missing_index_totals 

Out:  ValueError: cannot join with no overlapping index names

As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of @exp1orer’s accepted answer

With the df, I’ll call it by the alias state_office_sales:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).

In:   state_total_sales = df.groupby(level=0).sum()
      state_total_sales

Out: 
       sales
state   
AZ     2448009
CA     2832270
CO     1495486
WA     595859

Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:

In:   state_office_sales / state_total_sales

Out:  

                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          0.288022
        3          0.322169
        5          0.389809
CO      1          0.206684
        3          0.357891
        5          0.435425
WA      2          0.321689
        4          0.346325
        6          0.331986

To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:

In:   partial_total = pd.DataFrame(
                      data   =  {'sales' : [2448009, 595859, 99999]},
                      index  =             ['AZ',    'WA',   'XX' ]
                      )
      partial_total.index.name = 'state'


Out:  
         sales
state
AZ       2448009
WA       595859
XX       99999
In:   state_office_sales / partial_total

Out: 
                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          NaN
        3          NaN
        5          NaN
CO      1          NaN
        3          NaN
        5          NaN
WA      2          0.321689
        4          0.346325
        6          0.331986

This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.

In:   missing_index_totals = state_total_sales.rename_axis("")
      missing_index_totals

Out:  
       sales
AZ     2448009
CA     2832270
CO     1495486
WA     595859
In:   state_office_sales / missing_index_totals 

Out:  ValueError: cannot join with no overlapping index names

回答 13

一线解决方案:

df.join(
    df.groupby('state').agg(state_total=('sales', 'sum')),
    on='state'
).eval('sales / state_total')

这将返回一系列按办公室使用的比率-可以单独使用或分配给原始数据框。

One-line solution:

df.join(
    df.groupby('state').agg(state_total=('sales', 'sum')),
    on='state'
).eval('sales / state_total')

This returns a Series of per-office ratios — can be used on it’s own or assigned to the original Dataframe.


如何遍历分组的熊猫数据框?

问题:如何遍历分组的熊猫数据框?

数据框:

  c_os_family_ss c_os_major_is l_customer_id_i
0      Windows 7                         90418
1      Windows 7                         90418
2      Windows 7                         90418

码:

print df
for name, group in df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)):
    print name
    print group

我正在尝试仅遍历聚合数据,但出现错误:

ValueError:太多值无法解包

@EdChum,这是预期的输出:

                                                    c_os_family_ss  \
l_customer_id_i
131572           Windows 7,Windows 7,Windows 7,Windows 7,Window...
135467           Windows 7,Windows 7,Windows 7,Windows 7,Window...

                                                     c_os_major_is
l_customer_id_i
131572           ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...
135467           ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...

输出不是问题,我希望遍历每个组。

DataFrame:

  c_os_family_ss c_os_major_is l_customer_id_i
0      Windows 7                         90418
1      Windows 7                         90418
2      Windows 7                         90418

Code:

print df
for name, group in df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)):
    print name
    print group

I’m trying to just loop over the aggregated data, but I get the error:

ValueError: too many values to unpack

@EdChum, here’s the expected output:

                                                    c_os_family_ss  \
l_customer_id_i
131572           Windows 7,Windows 7,Windows 7,Windows 7,Window...
135467           Windows 7,Windows 7,Windows 7,Windows 7,Window...

                                                     c_os_major_is
l_customer_id_i
131572           ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...
135467           ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...

The output is not the problem, I wish to loop over every group.


回答 0

df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)) 确实已经返回了数据帧,因此您无法再遍历这些组。

一般来说:

  • df.groupby(...)返回一个GroupBy对象(DataFrameGroupBy或SeriesGroupBy),以及与此,您可以迭代通过组(如文档解释这里)。您可以执行以下操作:

    grouped = df.groupby('A')
    
    for name, group in grouped:
        ...
  • 当您应用在GROUPBY,在你的榜样的功能df.groupby(...).agg(...)(但是这也可以是transformapplymean,…),你结合的结果应用的功能,不同的群体集中在一个数据框(在适用和结合的步骤groupby的“ split-apply-combine”范式。因此,其结果将始终是DataFrame(或Series,具体取决于所应用的功能)。

df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)) does already return a dataframe, so you cannot loop over the groups anymore.

In general:

  • df.groupby(...) returns a GroupBy object (a DataFrameGroupBy or SeriesGroupBy), and with this, you can iterate through the groups (as explained in the docs here). You can do something like:

    grouped = df.groupby('A')
    
    for name, group in grouped:
        ...
    
  • When you apply a function on the groupby, in your example df.groupby(...).agg(...) (but this can also be transform, apply, mean, …), you combine the result of applying the function to the different groups together in one dataframe (the apply and combine step of the ‘split-apply-combine’ paradigm of groupby). So the result of this will always be again a DataFrame (or a Series depending on the applied function).


回答 1

这是一个迭代pd.DataFrame按列分组的示例atable。对于示例用例,将在for循环内生成SQL数据库的“创建”语句:

import pandas as pd

df1 = pd.DataFrame({
    'atable':     ['Users', 'Users', 'Domains', 'Domains', 'Locks'],
    'column':     ['col_1', 'col_2', 'col_a', 'col_b', 'col'],
    'column_type':['varchar', 'varchar', 'int', 'varchar', 'varchar'],
    'is_null':    ['No', 'No', 'Yes', 'No', 'Yes'],
})

df1_grouped = df1.groupby('atable')

# iterate over each group
for group_name, df_group in df1_grouped:
    print('\nCREATE TABLE {}('.format(group_name))

    for row_index, row in df_group.iterrows():
        col = row['column']
        column_type = row['column_type']
        is_null = 'NOT NULL' if row['is_null'] == 'NO' else ''
        print('\t{} {} {},'.format(col, column_type, is_null))

    print(");")

Here is an example of iterating over a pd.DataFrame grouped by the column atable. For an sample usecase, “create” statements for an SQL database are generated within the for loop:

import pandas as pd

df1 = pd.DataFrame({
    'atable':     ['Users', 'Users', 'Domains', 'Domains', 'Locks'],
    'column':     ['col_1', 'col_2', 'col_a', 'col_b', 'col'],
    'column_type':['varchar', 'varchar', 'int', 'varchar', 'varchar'],
    'is_null':    ['No', 'No', 'Yes', 'No', 'Yes'],
})

df1_grouped = df1.groupby('atable')

# iterate over each group
for group_name, df_group in df1_grouped:
    print('\nCREATE TABLE {}('.format(group_name))

    for row_index, row in df_group.iterrows():
        col = row['column']
        column_type = row['column_type']
        is_null = 'NOT NULL' if row['is_null'] == 'NO' else ''
        print('\t{} {} {},'.format(col, column_type, is_null))

    print(");")

回答 2

如果已经创建了数据框,则可以遍历索引值。

df = df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))
for name in df.index:
    print name
    print df.loc[name]

You can iterate over the index values if your dataframe has already been created.

df = df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))
for name in df.index:
    print name
    print df.loc[name]

将“熊猫”列中的字典/列表拆分为单独的列

问题:将“熊猫”列中的字典/列表拆分为单独的列

我将数据保存在postgreSQL数据库中。我正在使用Python2.7查询此数据并将其转换为Pandas DataFrame。但是,此数据框的最后一列中包含值的字典(或列表?)。DataFrame看起来像这样:

[1] df
Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

我需要将此列拆分为单独的列,以便DataFrame如下所示:

[2] df2
Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

我遇到的主要问题是列表的长度不同。但是所有列表最多只能包含相同的3个值:a,b和c。而且它们始终以相同的顺序出现(第一,第二,第三)。

以下代码用于工作并返回我想要的内容(df2)。

[3] df 
[4] objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
[5] df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
[6] print(df2)

我上周才运行此代码,并且运行良好。但是现在我的代码坏了,我从第[4]行得到了这个错误:

IndexError: out-of-bounds on slice (end) 

我没有对代码进行任何更改,但是现在出现了错误。我觉得这是由于我的方法不够健壮或不合适。

对于如何将列表的此列拆分为单独的列的任何建议或指导,将不胜感激!

编辑:我认为.tolist()和.apply方法不适用于我的代码,因为它是一个unicode字符串,即:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

数据是从PostgreSQL数据库以这种格式导入的。这个问题有什么帮助或想法吗?有没有办法转换unicode?

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary (or list?) of values within it. The DataFrame looks like this:

[1] df
Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

I need to split this column into separate columns so that the DataFrame looks like this:

[2] df2
Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

The major issue I’m having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: a, b, and c. And they always appear in the same order (a first, b second, c third).

The following code USED to work and return exactly what I wanted (df2).

[3] df 
[4] objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
[5] df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
[6] print(df2)

I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

IndexError: out-of-bounds on slice (end) 

I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

EDIT: I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

The data is importing from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?


回答 0

要将字符串转换为实际的dict,可以执行df['Pollutant Levels'].map(eval)。之后,可以使用以下解决方案将dict转换为不同的列。


通过一个小例子,您可以使用.apply(pd.Series)

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

In [3]: df
Out[3]:
   a                   b
0  1           {u'c': 1}
1  2           {u'd': 3}
2  3  {u'c': 5, u'd': 6}

In [4]: df['b'].apply(pd.Series)
Out[4]:
     c    d
0  1.0  NaN
1  NaN  3.0
2  5.0  6.0

要将其与数据框的其余部分合并,可以concat将其他列与上述结果结合在一起:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

使用我的代码,如果我省略了这一iloc部分,这也可以工作:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.


Using a small example, you can use .apply(pd.Series):

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

In [3]: df
Out[3]:
   a                   b
0  1           {u'c': 1}
1  2           {u'd': 3}
2  3  {u'c': 5, u'd': 6}

In [4]: df['b'].apply(pd.Series)
Out[4]:
     c    d
0  1.0  NaN
1  NaN  3.0
2  5.0  6.0

To combine it with the rest of the dataframe, you can concat the other columns with the above result:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

Using your code, this also works if I leave out the iloc part:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

回答 1

我知道这个问题已经很老了,但是我到这里来寻找答案。实际上,现在有一种更好(更快)的方法json_normalize

import pandas as pd

df2 = pd.json_normalize(df['Pollutant Levels'])

这避免了昂贵的应用功能…

I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:

import pandas as pd

df2 = pd.json_normalize(df['Pollutant Levels'])

This avoids costly apply functions…


回答 2

尝试以下操作: 从SQL返回的数据必须转换为Dict。 还是 "Pollutant Levels" 现在Pollutants'

   StationID                   Pollutants
0       8809  {"a":"46","b":"3","c":"12"}
1       8810   {"a":"36","b":"5","c":"8"}
2       8811            {"b":"2","c":"7"}
3       8812                   {"c":"11"}
4       8813          {"a":"82","c":"15"}


df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )

    a    b   c
0   46    3  12
1   36    5   8
2  NaN    2   7
3  NaN  NaN  11
4   82  NaN  15


result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)
result

   StationID    a    b   c
0       8809   46    3  12
1       8810   36    5   8
2       8811  NaN    2   7
3       8812  NaN  NaN  11
4       8813   82  NaN  15

Try this: The data returned from SQL has to converted into a Dict. or could it be "Pollutant Levels" is now Pollutants'

   StationID                   Pollutants
0       8809  {"a":"46","b":"3","c":"12"}
1       8810   {"a":"36","b":"5","c":"8"}
2       8811            {"b":"2","c":"7"}
3       8812                   {"c":"11"}
4       8813          {"a":"82","c":"15"}


df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )

    a    b   c
0   46    3  12
1   36    5   8
2  NaN    2   7
3  NaN  NaN  11
4   82  NaN  15


result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)
result

   StationID    a    b   c
0       8809   46    3  12
1       8810   36    5   8
2       8811  NaN    2   7
3       8812  NaN  NaN  11
4       8813   82  NaN  15

回答 3

Merlin的答案更好,更简单,但是我们不需要lambda函数。可以通过以下两种方式之一安全地忽略对字典的评估:

方法1:两步

# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)

df_pol_ps:
    a   b   c
0   46  3   12
1   36  5   8
2   NaN 2   7
3   NaN NaN 11
4   82  NaN 15

# step 2: concat columns `a, b, c` and drop/remove the `Pollutants` 
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

方式2:以上两个步骤可以一并组合:

df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

Merlin’s answer is better and super easy, but we don’t need a lambda function. The evaluation of dictionary can be safely ignored by either of the following two ways as illustrated below:

Way 1: Two steps

# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)

df_pol_ps:
    a   b   c
0   46  3   12
1   36  5   8
2   NaN 2   7
3   NaN NaN 11
4   82  NaN 15

# step 2: concat columns `a, b, c` and drop/remove the `Pollutants` 
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

Way 2: The above two steps can be combined in one go:

df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

回答 4

我强烈建议该方法提取“污染物”列:

df_pollutants = pd.DataFrame(df['Pollutants'].values.tolist(), index=df.index)

它比

df_pollutants = df['Pollutants'].apply(pd.Series)

当df的大小很大时。

I strongly recommend the method extract the column ‘Pollutants’:

df_pollutants = pd.DataFrame(df['Pollutants'].values.tolist(), index=df.index)

it’s much faster than

df_pollutants = df['Pollutants'].apply(pd.Series)

when the size of df is giant.


回答 5

你可以用joinpop+ tolist。性能concatdrop+ 相当tolist,但有些人可能会发现此语法更简洁:

res = df.join(pd.DataFrame(df.pop('b').tolist()))

使用其他方法进行基准测试:

df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

def joris1(df):
    return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)

def joris2(df):
    return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)

def jpp(df):
    return df.join(pd.DataFrame(df.pop('b').tolist()))

df = pd.concat([df]*1000, ignore_index=True)

%timeit joris1(df.copy())  # 1.33 s per loop
%timeit joris2(df.copy())  # 7.42 ms per loop
%timeit jpp(df.copy())     # 7.68 ms per loop

You can use join with pop + tolist. Performance is comparable to concat with drop + tolist, but some may find this syntax cleaner:

res = df.join(pd.DataFrame(df.pop('b').tolist()))

Benchmarking with other methods:

df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

def joris1(df):
    return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)

def joris2(df):
    return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)

def jpp(df):
    return df.join(pd.DataFrame(df.pop('b').tolist()))

df = pd.concat([df]*1000, ignore_index=True)

%timeit joris1(df.copy())  # 1.33 s per loop
%timeit joris2(df.copy())  # 7.42 ms per loop
%timeit jpp(df.copy())     # 7.68 ms per loop

回答 6

一种解决方案如下:

>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
   Station ID    a    b   c
0        8809   46    3  12
1        8810   36    5   8
2        8811  NaN    2   7
3        8812  NaN  NaN  11
4        8813   82  NaN  15

One line solution is following:

>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
   Station ID    a    b   c
0        8809   46    3  12
1        8810   36    5   8
2        8811  NaN    2   7
3        8812  NaN  NaN  11
4        8813   82  NaN  15

回答 7

my_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['my_col'])

..本可以正确解析字典(将每个字典键放入单独的df列中,并将键值放入df行中),因此这些dict首先不会被压入单个列中。

my_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['my_col'])

.. would have parsed the dict properly (putting each dict key into a separate df column, and key values into df rows), so the dicts would not get squashed into a single column in the first place.


回答 8

我将这些步骤串联在一个方法中,您只需要传递数据框和包含扩展字典的列即可:

def expand_dataframe(dw: pd.DataFrame, column_to_expand: str) -> pd.DataFrame:
    """
    dw: DataFrame with some column which contain a dict to expand
        in columns
    column_to_expand: String with column name of dw
    """
    import pandas as pd

    def convert_to_dict(sequence: str) -> Dict:
        import json
        s = sequence
        json_acceptable_string = s.replace("'", "\"")
        d = json.loads(json_acceptable_string)
        return d    

    expanded_dataframe = pd.concat([dw.drop([column_to_expand], axis=1),
                                    dw[column_to_expand]
                                    .apply(convert_to_dict)
                                    .apply(pd.Series)],
                                    axis=1)
    return expanded_dataframe

I’ve concatenated those steps in a method, you have to pass only the dataframe and the column which contains the dict to expand:

def expand_dataframe(dw: pd.DataFrame, column_to_expand: str) -> pd.DataFrame:
    """
    dw: DataFrame with some column which contain a dict to expand
        in columns
    column_to_expand: String with column name of dw
    """
    import pandas as pd

    def convert_to_dict(sequence: str) -> Dict:
        import json
        s = sequence
        json_acceptable_string = s.replace("'", "\"")
        d = json.loads(json_acceptable_string)
        return d    

    expanded_dataframe = pd.concat([dw.drop([column_to_expand], axis=1),
                                    dw[column_to_expand]
                                    .apply(convert_to_dict)
                                    .apply(pd.Series)],
                                    axis=1)
    return expanded_dataframe

回答 9

df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)
df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)

在熊猫MultiIndex DataFrame中选择行

问题:在熊猫MultiIndex DataFrame中选择行

选择/过滤索引为MultiIndex数据框的行的最常见的熊猫方法是什么

  • 根据单个值/标签切片
  • 根据一个或多个级别的多个标签进行切片
  • 过滤布尔条件和表达式
  • 哪种方法在什么情况下适用

为简单起见的假设:

  1. 输入数据框没有重复的索引键
  2. 下面的输入数据框只有两个级别。(此处显示的大多数解决方案一般都适用于N级)

输入示例:

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

问题1:选择单个项目

如何选择在“一个”级别中具有“ a”的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

另外,我如何在输出中将级别“一”降低?

     col
two     
t      0
u      1
v      2
w      3

问题1b
如何在级别“ two”上切片所有值为“ t”的行?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

问题2:在一个级别中选择多个值

如何在级别“ one”中选择与项目“ b”和“ d”相对应的行?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

问题2b
我如何获得与“二”级中的“ t”和“ w”相对应的所有值?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

问题3:切片单个横截面 (x, y)

如何检索横截面,即具有从中为索引指定值的单行df?具体来说,我如何检索的横截面('c', 'u'),由

         col
one two     
c   u      9

问题4:切片多个横截面 [(a, b), (c, d), ...]

如何选择与('c', 'u')和相对应的两行('a', 'w')

         col
one two     
c   u      9
a   w      3

问题5:每个级别切成一个项目

如何检索与“一级”中的“ a”或“二级”中的“ t”相对应的所有行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

问题6:任意切片

如何切特定的横截面?对于“ a”和“ b”,我想选择子级别为“ u”和“ v”的所有行,对于“ d”,我想选择子级别为“ w”的行。

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

问题7将使用由数字级别组成的唯一设置:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

问题7:按数字不等式对多索引的各个级别进行过滤

如何获得“二级”中的值大于5的所有行?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

注意:本文将介绍如何创建MultiIndexes,如何对其执行赋值操作或任何与性能相关的讨论(这些是下次的单独主题)。

What are the most common pandas ways to select/filter rows of a dataframe whose index is a MultiIndex?

  • Slicing based on a single value/label
  • Slicing based on multiple labels from one or more levels
  • Filtering on boolean conditions and expressions
  • Which methods are applicable in what circumstances

Assumptions for simplicity:

  1. input dataframe does not have duplicate index keys
  2. input dataframe below only has two levels. (Most solutions shown here generalize to N levels)

Example input:

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 1: Selecting a Single Item

How do I select rows having “a” in level “one”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

Additionally, how would I be able to drop level “one” in the output?

     col
two     
t      0
u      1
v      2
w      3

Question 1b
How do I slice all rows with value “t” on level “two”?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Question 2: Selecting Multiple Values in a Level

How can I select rows corresponding to items “b” and “d” in level “one”?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 2b
How would I get all values corresponding to “t” and “w” in level “two”?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

Question 3: Slicing a Single Cross Section (x, y)

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

Question 5: One Item Sliced per Level

How can I retrieve all rows corresponding to “a” in level “one” or “t” in level “two”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

Question 6: Arbitrary Slicing

How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and for “d”, I would like to select rows with sub-level “w”.

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

Question 7 will use a unique setup consisting of a numeric level:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

Question 7: Filtering by numeric inequality on individual levels of the multiindex

How do I get all rows where values in level “two” are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

Note: This post will not go through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).


回答 0

多索引/高级索引

注意:
此帖子的结构如下:

  1. 操作规范中提出的问题将一一解决
  2. 对于每个问题,将演示一种或多种适用于解决该问题并获得预期结果的方法。

注意 s(非常类似于此内容)将被提供给有兴趣学习更多功能,实现细节和其他手头主题信息的读者。这些注释是通过搜集文档并发现各种晦涩难懂的功能,以及从我自己的(公认的有限)经验中编写的。

所有代码示例均已在pandas v0.23.4,python3.7上创建并测试。如果有不清楚的地方,或者实际上是不正确的,或者您找不到适用于您的用例的解决方案,请随时提出修改建议,在注释中要求澄清,或打开一个新问题…… 。

这是对一些常见习语(以下称为“四个习语”)的介绍,我们将经常对其进行复习。

  1. DataFrame.loc-按标签选择的一般解决方案(+ pd.IndexSlice表示涉及切片的更复杂应用)

  2. DataFrame.xs -从Series / DataFrame中提取特定的横截面。

  3. DataFrame.query-动态指定切片和/或过滤操作(即,作为动态评估的表达式。比其他情况更适用于某些方案。另请参阅文档的此部分以查询MultiIndexes。

  4. 使用生成的掩码进行布尔索引MultiIndex.get_level_values(通常与结合使用Index.isin,尤其是在使用多个值进行过滤时)。在某些情况下这也很有用。

从四个习语的角度来看各种切片和过滤问题,将有助于更好地理解可以应用于给定情况的内容,这将是有益的。非常重要的一点是要了解,并非所有习惯用法在每种情况下都一样有效(如果有的话)。如果没有将成语列为以下问题的潜在解决方案,则意味着该成语不能有效地应用于该问题。


问题1

如何选择在“一个”级别中具有“ a”的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

您可以将loc用作适用于大多数情况的通用解决方案:

df.loc[['a']]

此时,如果您得到

TypeError: Expected tuple, got str

这意味着您使用的是旧版熊猫。考虑升级!否则,请使用df.loc[('a', slice(None)), :]

或者,您可以xs在这里使用,因为我们要提取单个横截面。请注意levelsaxis参数(此处可以采用合理的默认值)。

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)

在这里,drop_level=False需要使用参数来防止xs在结果(我们切入的水平)上降低“一级”。

这里的另一个选择是使用query

df.query("one == 'a'")

如果索引没有名称,则需要将查询字符串更改为"ilevel_0 == 'a'"

最后,使用get_level_values

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']

另外,我如何在输出中将级别“一”降低?

     col
two     
t      0
u      1
v      2
w      3

使用以下任一方法均可轻松完成此操作

df.loc['a'] # Notice the single string argument instead the list.

要么,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')

注意,我们可以省略该drop_level参数(True默认情况下假定为该参数)。

注意
您可能会注意到,经过过滤的DataFrame可能仍然具有所有级别,即使在打印出DataFrame时不显示它们也是如此。例如,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

您可以使用以下方法消除这些级别MultiIndex.remove_unused_levels

v.index = v.index.remove_unused_levels()

print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

问题1b

如何在级别“ two”上切片所有值为“ t”的行?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

直观地讲,您需要包含以下内容slice()

df.loc[(slice(None), 't'), :]

它就是行得通!™,但是笨拙。我们可以在pd.IndexSlice此处使用API 促进更自然的切片语法。

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]

这要干净得多。

注意
为什么:需要跨列的尾随切片?这是因为loc可以用于沿两个轴(axis=0axis=1)进行选择和切片。没有明确说明要在哪个轴上进行切片,操作将变得模棱两可。请参阅切片文档中的红色大框。

如果要消除任何歧义,请loc接受一个axis 参数:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

如果没有该axis参数(即仅执行df.loc[pd.IndexSlice[:, 't']]),则假定切片在列上,并且KeyError在这种情况下将引发a 。

切片器中对此进行了记录。但是,出于本文的目的,我们将明确指定所有轴。

xs,它是

df.xs('t', axis=0, level=1, drop_level=False)

query,它是

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'") 

最后,通过get_level_values,您可以

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']

全部达到相同的效果。


问题2

如何在级别“ one”中选择与项目“ b”和“ d”相对应的行?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

使用loc,可以通过指定列表以类似方式完成此操作。

df.loc[['b', 'd']]

要解决选择“ b”和“ d”的上述问题,您还可以使用query

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')

注意:
是的,默认解析器为'pandas',但重要的是要突出此语法不是python。Pandas解析器从表达式生成的解析树略有不同。这样做是为了使某些操作更直观地指定。有关更多信息,请阅读我关于 使用pd.eval()在熊猫中进行动态表达评估的文章

并且,用get_level_values+ Index.isin

df[df.index.get_level_values("one").isin(['b', 'd'])]

问题2b

我如何获得与“第二”级别中的“ t”和“ w”相对应的所有值?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

使用loc只有与结合使用才有可能pd.IndexSlice

df.loc[pd.IndexSlice[:, ['t', 'w']], :] 

中的第一个冒号:pd.IndexSlice[:, ['t', 'w']]指跨越第一级。随着要查询的级别的深度增加,您将需要指定更多的切片,每个级别将切片一个。您不需要指定多层次超越但被切片的一个。

使用query,这是

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')

get_level_valuesIndex.isin(类似于上面):

df[df.index.get_level_values('two').isin(['t', 'w'])]

问题3

如何检索横截面,即具有从中为索引指定值的单行df?具体来说,我如何检索的横截面('c', 'u'),由

         col
one two     
c   u      9

loc通过指定键元组来使用:

df.loc[('c', 'u'), :]

要么,

df.loc[pd.IndexSlice[('c', 'u')]]

注意
此时,您可能会遇到PerformanceWarning如下所示的:

PerformanceWarning: indexing past lexsort depth may impact performance.

这仅表示您的索引未排序。大熊猫取决于要进行最佳搜索和检索的索引(在这种情况下,按字典顺序排序,因为我们正在处理字符串值)。一种快速的解决方法是使用预先对DataFrame进行排序DataFrame.sort_index。如果您计划一并执行多个此类查询,那么从性能角度来看,这是特别理想的:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]

您还可以MultiIndex.is_lexsorted()用来检查索引是否已排序。此函数返回TrueFalse相应地。您可以调用此函数来确定是否需要其他排序步骤。

使用xs,这再次简单地传递了一个元组作为第一个参数,而所有其他参数都设置为其适当的默认值:

df.xs(('c', 'u'))

使用query,事情变得有些笨拙:

df.query("one == 'c' and two == 'u'")

您现在可以看到,这将很难一概而论。但是对于这个特定问题仍然可以。

跨多个级别的访问get_level_values仍然可以使用,但不建议这样做:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]

问题4

如何选择与('c', 'u')和相对应的两行('a', 'w')

         col
one two     
c   u      9
a   w      3

使用loc,这仍然很简单:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

使用query,您将需要通过遍历横截面和层次来动态生成查询字符串:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)

100%不推荐!但是有可能。


问题5

如何检索与“一级”中的“ a”或“二级”中的“ t”相对应的所有行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

loc在确保正确性仍保持代码清晰的同时,这实际上很难做到。df.loc[pd.IndexSlice['a', 't']]不正确,将其解释为df.loc[pd.IndexSlice[('a', 't')]](即选择横截面)。您可能会想到一种解决方案,pd.concat可以单独处理每个标签:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12

但是您会注意到其中一行是重复的。这是因为该行同时满足两个切片条件,因此出现了两次。您将需要做

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]

但是,如果您的DataFrame固有地包含重复的索引(所需),那么它将不会保留它们。 使用时要格外小心

使用query,这非常简单:

df.query("one == 'a' or two == 't'")

使用get_level_values,这仍然很简单,但并不那么优雅:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2] 

问题6

如何切特定的横截面?对于“ a”和“ b”,我想选择子级别为“ u”和“ v”的所有行,对于“ d”,我想选择子级别为“ w”的行。

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

我添加了这是一个特殊情况,以帮助理解四个惯用语的用法-这是其中的任何一个都无法有效工作的情况,因为切片非常具体,并且没有遵循任何实际模式。

通常,像这样的切片问题将需要将键列表显式传递给loc。一种方法是:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]

如果要保存一些键入内容,您将认识到存在一种切片“ a”,“ b”及其子级别的模式,因此我们可以将切片任务分为两部分和concat结果:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)

“ a”和“ b”的切片规范稍微清晰一些,(('a', 'b'), ('u', 'v'))因为被索引的相同子级别对于每个级别都是相同的。


问题7

如何获得“二级”中的值大于5的所有行?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

这是可以做到用query

df2.query("two > 5")

get_level_values

df2[df2.index.get_level_values('two') > 5]

注意
类似于此示例,我们可以使用这些构造基于任意条件进行过滤。在一般情况下,要记住,这是非常有用的loc,并xs是专为基于标签的索引,而queryget_level_values是构建一般条件口罩用于过滤很有帮助。


奖金问题

如果我需要对MultiIndex 进行切片怎么办?

实际上,此处的大多数解决方案也适用于色谱柱,只需稍作更改即可。考虑:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7

这些是您需要对四个习惯用法进行的以下更改,才能使它们与列一起使用。

  1. 要切片loc,请使用

    df3.loc[:, ....] # Notice how we slice across the index with `:`. 

    要么,

    df3.loc[:, pd.IndexSlice[...]]
  2. xs适当使用,只需传递一个参数axis=1

  3. 您可以使用直接访问列级别值df.columns.get_level_values。然后,您需要做类似的事情

    df.loc[:, {condition}] 

    其中{condition}代表使用建立的某些条件columns.get_level_values

  4. 要使用query,您唯一的选择是转置,查询索引并再次转置:

    df3.T.query(...).T

    不建议使用其他3个选项之一。

MultiIndex / Advanced Indexing

Note
This post will be structured in the following manner:

  1. The questions put forth in the OP will be addressed, one by one
  2. For each question, one or more methods applicable to solving this problem and getting the expected result will be demonstrated.

Notes (much like this one) will be included for readers interested in learning about additional functionality, implementation details, and other info cursory to the topic at hand. These notes have been compiled through scouring the docs and uncovering various obscure features, and from my own (admittedly limited) experience.

All code samples have created and tested on pandas v0.23.4, python3.7. If something is not clear, or factually incorrect, or if you did not find a solution applicable to your use case, please feel free to suggest an edit, request clarification in the comments, or open a new question, ….as applicable.

Here is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting

  1. DataFrame.loc – A general solution for selection by label (+ pd.IndexSlice for more complex applications involving slices)

  2. DataFrame.xs – Extract a particular cross section from a Series/DataFrame.

  3. DataFrame.query – Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docs for querying on MultiIndexes.

  4. Boolean indexing with a mask generated using MultiIndex.get_level_values (often in conjunction with Index.isin, especially when filtering with multiple values). This is also quite useful in some circumstances.

It will be beneficial to look at the various slicing and filtering problems in terms of the Four Idioms to gain a better understanding what can be applied to a given situation. It is very important to understand that not all of the idioms will work equally well (if at all) in every circumstance. If an idiom has not been listed as a potential solution to a problem below, that means that idiom cannot be applied to that problem effectively.


Question 1

How do I select rows having “a” in level “one”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

You can use loc, as a general purpose solution applicable to most situations:

df.loc[['a']]

At this point, if you get

TypeError: Expected tuple, got str

That means you’re using an older version of pandas. Consider upgrading! Otherwise, use df.loc[('a', slice(None)), :].

Alternatively, you can use xs here, since we are extracting a single cross section. Note the levels and axis arguments (reasonable defaults can be assumed here).

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)

Here, the drop_level=False argument is needed to prevent xs from dropping level “one” in the result (the level we sliced on).

Yet another option here is using query:

df.query("one == 'a'")

If the index did not have a name, you would need to change your query string to be "ilevel_0 == 'a'".

Finally, using get_level_values:

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']

Additionally, how would I be able to drop level “one” in the output?

     col
two     
t      0
u      1
v      2
w      3

This can be easily done using either

df.loc['a'] # Notice the single string argument instead the list.

Or,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')

Notice that we can omit the drop_level argument (it is assumed to be True by default).

Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

You can get rid of these levels using MultiIndex.remove_unused_levels:

v.index = v.index.remove_unused_levels()
print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

Question 1b

How do I slice all rows with value “t” on level “two”?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Intuitively, you would want something involving slice():

df.loc[(slice(None), 't'), :]

It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSlice API here.

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]

This is much, much cleaner.

Note
Why is the trailing slice : across the columns required? This is because, loc can be used to select and slice along both axes (axis=0 or axis=1). Without explicitly making it clear which axis the slicing is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.

If you want to remove any shade of ambiguity, loc accepts an axis parameter:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

Without the axis parameter (i.e., just by doing df.loc[pd.IndexSlice[:, 't']]), slicing is assumed to be on the columns, and a KeyError will be raised in this circumstance.

This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.

With xs, it is

df.xs('t', axis=0, level=1, drop_level=False)

With query, it is

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'") 

And finally, with get_level_values, you may do

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']

All to the same effect.


Question 2

How can I select rows corresponding to items “b” and “d” in level “one”?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Using loc, this is done in a similar fashion by specifying a list.

df.loc[['b', 'd']]

To solve the above problem of selecting “b” and “d”, you can also use query:

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')

Note
Yes, the default parser is 'pandas', but it is important to highlight this syntax isn’t conventionally python. The Pandas parser generates a slightly different parse tree from the expression. This is done to make some operations more intuitive to specify. For more information, please read my post on Dynamic Expression Evaluation in pandas using pd.eval().

And, with get_level_values + Index.isin:

df[df.index.get_level_values("one").isin(['b', 'd'])]

Question 2b

How would I get all values corresponding to “t” and “w” in level “two”?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

With loc, this is possible only in conjuction with pd.IndexSlice.

df.loc[pd.IndexSlice[:, ['t', 'w']], :] 

The first colon : in pd.IndexSlice[:, ['t', 'w']] means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.

With query, this is

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')

With get_level_values and Index.isin (similar to above):

df[df.index.get_level_values('two').isin(['t', 'w'])]

Question 3

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

Use loc by specifying a tuple of keys:

df.loc[('c', 'u'), :]

Or,

df.loc[pd.IndexSlice[('c', 'u')]]

Note
At this point, you may run into a PerformanceWarning that looks like this:

PerformanceWarning: indexing past lexsort depth may impact performance.

This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]

You can also use MultiIndex.is_lexsorted() to check whether the index is sorted or not. This function returns True or False accordingly. You can call this function to determine whether an additional sorting step is required or not.

With xs, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:

df.xs(('c', 'u'))

With query, things become a bit clunky:

df.query("one == 'c' and two == 'u'")

You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.

With accesses spanning multiple levels, get_level_values can still be used, but is not recommended:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]

Question 4

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

With loc, this is still as simple as:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

With query, you will need to dynamically generate a query string by iterating over your cross sections and levels:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)

100% DO NOT RECOMMEND! But it is possible.

What if I have multiple levels?
One option in this scenario would be to use droplevel to drop the levels you’re not checking, then use isin to test membership, and then boolean index on the final result.

df[df.index.droplevel(unused_level).isin([('c', 'u'), ('a', 'w')])]

Question 5

How can I retrieve all rows corresponding to “a” in level “one” or “t” in level “two”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

This is actually very difficult to do with loc while ensuring correctness and still maintaining code clarity. df.loc[pd.IndexSlice['a', 't']] is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]] (i.e., selecting a cross section). You may think of a solution with pd.concat to handle each label separately:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12

But you’ll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]

But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.

With query, this is stupidly simple:

df.query("one == 'a' or two == 't'")

With get_level_values, this is still simple, but not as elegant:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2] 

Question 6

How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and for “d”, I would like to select rows with sub-level “w”.

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

This is a special case that I’ve added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.

Usually, slicing problems like this will require explicitly passing a list of keys to loc. One way of doing this is with:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]

If you want to save some typing, you will recognise that there is a pattern to slicing “a”, “b” and its sublevels, so we can separate the slicing task into two portions and concat the result:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)

Slicing specification for “a” and “b” is slightly cleaner (('a', 'b'), ('u', 'v')) because the same sub-levels being indexed are the same for each level.


Question 7

How do I get all rows where values in level “two” are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

This can be done using query,

df2.query("two > 5")

And get_level_values.

df2[df2.index.get_level_values('two') > 5]

Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember that loc and xs are specifically for label-based indexing, while query and get_level_values are helpful for building general conditional masks for filtering.


Bonus Question

What if I need to slice a MultiIndex column?

Actually, most solutions here are applicable to columns as well, with minor changes. Consider:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7

These are the following changes you will need to make to the Four Idioms to have them working with columns.

  1. To slice with loc, use

     df3.loc[:, ....] # Notice how we slice across the index with `:`. 
    

    or,

     df3.loc[:, pd.IndexSlice[...]]
    
  2. To use xs as appropriate, just pass an argument axis=1.

  3. You can access the column level values directly using df.columns.get_level_values. You will then need to do something like

     df.loc[:, {condition}] 
    

    Where {condition} represents some condition built using columns.get_level_values.

  4. To use query, your only option is to transpose, query on the index, and transpose again:

     df3.T.query(...).T
    

    Not recommended, use one of the other 3 options.


回答 1

最近,我遇到一个用例,其中有一个3级以上的多索引数据框,在其中我无法使上面的任何解决方案产生想要的结果。上面的解决方案很可能确实可以在我的用例中使用,并且我尝试了几种,但是我无法在我有空的时候让它们使用。

我距离专家还很远,但是我偶然发现了上面的综合答案中未列出的解决方案。我不保证解决方案无论如何都是最佳的。

这是获得与上面的问题6稍有不同的结果的不同方法。(以及其他可能的问题)

我特别在寻找:

  1. 一种从一个索引级别选择两个以上的值,从另一个索引级别选择一个值的方法,以及
  2. 在数据帧输出中保留上一操作的索引值的方法。

作为齿轮上的活动扳手(但完全可固定):

  1. 索引未命名。

在下面的玩具数据帧上:

    index = pd.MultiIndex.from_product([['a','b'],
                               ['stock1','stock2','stock3'],
                               ['price','volume','velocity']])

    df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
                      10,11,12,13,14,15,16,17,18], 
                       index)

                        0
    a stock1 price      1
             volume     2
             velocity   3
      stock2 price      4
             volume     5
             velocity   6
      stock3 price      7
             volume     8
             velocity   9
    b stock1 price     10
             volume    11
             velocity  12
      stock2 price     13
             volume    14
             velocity  15
      stock3 price     16
             volume    17
             velocity  18

当然,使用以下作品:

    df.xs(('stock1', 'velocity'), level=(1,2))

        0
    a   3
    b  12

但是我想要一个不同的结果,所以获得该结果的方法是:

   df.iloc[df.index.isin(['stock1'], level=1) & 
           df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
    b stock1 velocity  12

如果我想从一个级别获得两个以上的值,并从另一个级别获得一个(或2个以上)值:

    df.iloc[df.index.isin(['stock1','stock3'], level=1) & 
            df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
      stock3 velocity   9
    b stock1 velocity  12
      stock3 velocity  18

上面的方法可能有点笨拙,但是我发现它满足了我的需求,而且奖金对我来说更易于理解和阅读。

Recently I came across a use case where I had a 3+ level multi-index dataframe in which I couldn’t make any of the solutions above produce the results I was looking for. It’s quite possible that the above solutions do of course work for my use case, and I tried several, however I was unable to get them to work with the time I had available.

I am far from expert, but I stumbled across a solution that was not listed in the comprehensive answers above. I offer no guarantee that the solutions are in any way optimal.

This is a different way to get a slightly different result to Question #6 above. (and likely other questions as well)

Specifically I was looking for:

  1. A way to choose two+ values from one level of the index and a single value from another level of the index, and
  2. A way to leave the index values from the previous operation in the dataframe output.

As a monkey wrench in the gears (however totally fixable):

  1. The indexes were unnamed.

On the toy dataframe below:

    index = pd.MultiIndex.from_product([['a','b'],
                               ['stock1','stock2','stock3'],
                               ['price','volume','velocity']])

    df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
                      10,11,12,13,14,15,16,17,18], 
                       index)

                        0
    a stock1 price      1
             volume     2
             velocity   3
      stock2 price      4
             volume     5
             velocity   6
      stock3 price      7
             volume     8
             velocity   9
    b stock1 price     10
             volume    11
             velocity  12
      stock2 price     13
             volume    14
             velocity  15
      stock3 price     16
             volume    17
             velocity  18

Using the below works, of course:

    df.xs(('stock1', 'velocity'), level=(1,2))

        0
    a   3
    b  12

But I wanted a different result, so my method to get that result was:

   df.iloc[df.index.isin(['stock1'], level=1) & 
           df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
    b stock1 velocity  12

And if I wanted two+ values from one level and a single (or 2+) value from another level:

    df.iloc[df.index.isin(['stock1','stock3'], level=1) & 
            df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
      stock3 velocity   9
    b stock1 velocity  12
      stock3 velocity  18

The above method is probably a bit clunky, however I found it filled my needs and as a bonus was easier for me to understand and read.


使用Pandas在python中读取Excel文件

问题:使用Pandas在python中读取Excel文件

我正在尝试以这种方式读取Excel文件:

newFile = pd.ExcelFile(PATH\FileName.xlsx)
ParsedData = pd.io.parsers.ExcelFile.parse(newFile)

这引发了一个错误,该错误表示预期有两个参数,我不知道第二个参数是什么,而且我在这里想要实现的是将Excel文件转换为DataFrame,我这样做是否正确?或者还有其他方法可以使用熊猫吗?

I am trying to read an excel file this way :

newFile = pd.ExcelFile(PATH\FileName.xlsx)
ParsedData = pd.io.parsers.ExcelFile.parse(newFile)

which throws an error that says two arguments expected, I don’t know what the second argument is and also what I am trying to achieve here is to convert an Excel file to a DataFrame, Am I doing it the right way? or is there any other way to do this using pandas?


回答 0

关闭:首先调用ExcelFile,然后调用该.parse方法并将表单名称传递给该方法。

>>> xl = pd.ExcelFile("dummydata.xlsx")
>>> xl.sheet_names
[u'Sheet1', u'Sheet2', u'Sheet3']
>>> df = xl.parse("Sheet1")
>>> df.head()
                  Tid  dummy1    dummy2    dummy3    dummy4    dummy5  \
0 2006-09-01 00:00:00       0  5.894611  0.605211  3.842871  8.265307   
1 2006-09-01 01:00:00       0  5.712107  0.605211  3.416617  8.301360   
2 2006-09-01 02:00:00       0  5.105300  0.605211  3.090865  8.335395   
3 2006-09-01 03:00:00       0  4.098209  0.605211  3.198452  8.170187   
4 2006-09-01 04:00:00       0  3.338196  0.605211  2.970015  7.765058   

     dummy6  dummy7    dummy8    dummy9  
0  0.623354       0  2.579108  2.681728  
1  0.554211       0  7.210000  3.028614  
2  0.567841       0  6.940000  3.644147  
3  0.581470       0  6.630000  4.016155  
4  0.595100       0  6.350000  3.974442  

您正在做的是调用驻留在类本身而不是实例上的方法,这是可以的(尽管不是很惯用),但是如果这样做,则还需要传递工作表名称:

>>> parsed = pd.io.parsers.ExcelFile.parse(xl, "Sheet1")
>>> parsed.columns
Index([u'Tid', u'dummy1', u'dummy2', u'dummy3', u'dummy4', u'dummy5', u'dummy6', u'dummy7', u'dummy8', u'dummy9'], dtype=object)

Close: first you call ExcelFile, but then you call the .parse method and pass it the sheet name.

>>> xl = pd.ExcelFile("dummydata.xlsx")
>>> xl.sheet_names
[u'Sheet1', u'Sheet2', u'Sheet3']
>>> df = xl.parse("Sheet1")
>>> df.head()
                  Tid  dummy1    dummy2    dummy3    dummy4    dummy5  \
0 2006-09-01 00:00:00       0  5.894611  0.605211  3.842871  8.265307   
1 2006-09-01 01:00:00       0  5.712107  0.605211  3.416617  8.301360   
2 2006-09-01 02:00:00       0  5.105300  0.605211  3.090865  8.335395   
3 2006-09-01 03:00:00       0  4.098209  0.605211  3.198452  8.170187   
4 2006-09-01 04:00:00       0  3.338196  0.605211  2.970015  7.765058   

     dummy6  dummy7    dummy8    dummy9  
0  0.623354       0  2.579108  2.681728  
1  0.554211       0  7.210000  3.028614  
2  0.567841       0  6.940000  3.644147  
3  0.581470       0  6.630000  4.016155  
4  0.595100       0  6.350000  3.974442  

What you’re doing is calling the method which lives on the class itself, rather than the instance, which is okay (although not very idiomatic), but if you’re doing that you would also need to pass the sheet name:

>>> parsed = pd.io.parsers.ExcelFile.parse(xl, "Sheet1")
>>> parsed.columns
Index([u'Tid', u'dummy1', u'dummy2', u'dummy3', u'dummy4', u'dummy5', u'dummy6', u'dummy7', u'dummy8', u'dummy9'], dtype=object)

回答 1

这是非常简单的方法。

import pandas
df = pandas.read_excel(open('your_xls_xlsx_filename','rb'), sheetname='Sheet 1')
# or using sheet index starting 0
df = pandas.read_excel(open('your_xls_xlsx_filename','rb'), sheetname=2)

查看文档的完整详细信息 http://pandas.pydata.org/pandas-docs/version/0.17.1/genic/pandas.read_excel.html

FutureWarning:sheetname对于较新的Pandas版本,不推荐使用该关键字,请sheet_name改用。

This is much simple and easy way.

import pandas
df = pandas.read_excel(open('your_xls_xlsx_filename','rb'), sheetname='Sheet 1')
# or using sheet index starting 0
df = pandas.read_excel(open('your_xls_xlsx_filename','rb'), sheetname=2)

check out documentation full details http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.read_excel.html

FutureWarning: The sheetname keyword is deprecated for newer Pandas versions, use sheet_name instead.


回答 2

以为我应该在这里添加,如果要访问行或列以遍历它们,可以执行以下操作:

import pandas as pd

# open the file
xlsx = pd.ExcelFile(PATH\FileName.xlsx)

# get the first sheet as an object
sheet1 = xlsx.parse(0)

# get the first column as a list you can loop through
# where the is 0 in the code below change to the row or column number you want    
column = sheet1.icol(0).real

# get the first row as a list you can loop through
row = sheet1.irow(0).real

编辑:

现在不推荐使用方法icol(i)irow(i)。您可以使用sheet1.iloc[:,i]获取第i sheet1.iloc[i,:]行和获取第i行。

Thought i should add here, that if you want to access rows or columns to loop through them, you do this:

import pandas as pd

# open the file
xlsx = pd.ExcelFile(PATH\FileName.xlsx)

# get the first sheet as an object
sheet1 = xlsx.parse(0)

# get the first column as a list you can loop through
# where the is 0 in the code below change to the row or column number you want    
column = sheet1.icol(0).real

# get the first row as a list you can loop through
row = sheet1.irow(0).real

Edit:

The methods icol(i) and irow(i) are deprecated now. You can use sheet1.iloc[:,i] to get the i-th col and sheet1.iloc[i,:] to get the i-th row.


回答 3

我认为这应该可以满足您的需求:

import pandas as pd

# Read the excel sheet to pandas dataframe
DataFrame = pd.read_excel("PATH\FileName.xlsx", sheetname=0)

I think this should satisfy your need:

import pandas as pd

# Read the excel sheet to pandas dataframe
df = pd.read_excel("PATH\FileName.xlsx", sheetname=0)

回答 4

您只需要将文件的路径提供给 pd.read_excel

import pandas as pd

file_path = "./my_excel.xlsx"
data_frame = pd.read_excel(file_path)

检出文档以浏览参数,例如skiprows在加载Excel时忽略行

You just need to feed the path to your file to pd.read_excel

import pandas as pd

file_path = "./my_excel.xlsx"
data_frame = pd.read_excel(file_path)

Checkout the documentation to explore parameters like skiprows to ignore rows when loading the excel


回答 5

import pandas as pd

data = pd.read_excel (r'**YourPath**.xlsx')

print (data)
import pandas as pd

data = pd.read_excel (r'**YourPath**.xlsx')

print (data)

回答 6

这是语法更新的方法,在python代码中更常见。它还可以防止您多次打开同一文件。

import pandas as pd

sheet1, sheet2 = None, None
with pd.ExcelFile("PATH\FileName.xlsx") as reader:
    sheet1 = pd.read_excel(reader, sheet_name='Sheet1')
    sheet2 = pd.read_excel(reader, sheet_name='Sheet2')

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Here is an updated method with syntax that is more common in python code. It also prevents you from opening the same file multiple times.

import pandas as pd

sheet1, sheet2 = None, None
with pd.ExcelFile("PATH\FileName.xlsx") as reader:
    sheet1 = pd.read_excel(reader, sheet_name='Sheet1')
    sheet2 = pd.read_excel(reader, sheet_name='Sheet2')

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html


熊猫数据框fillna()仅存在一些列

问题:熊猫数据框fillna()仅存在一些列

我试图只对某些列子集用0填充Pandas数据框中的任何值。

当我做:

import pandas as pd
df = pd.DataFrame(data={'a':[1,2,3,None],'b':[4,5,None,6],'c':[None,None,7,8]})
print df
df.fillna(value=0, inplace=True)
print df

输出:

     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  NaN  7.0
3  NaN  6.0  8.0
     a    b    c
0  1.0  4.0  0.0
1  2.0  5.0  0.0
2  3.0  0.0  7.0
3  0.0  6.0  8.0

它取代了每一个None0的。我想要做的是,只有更换NoneS IN列ab,但不会c

最好的方法是什么?

I am trying to fill none values in a Pandas dataframe with 0’s for only some subset of columns.

When I do:

import pandas as pd
df = pd.DataFrame(data={'a':[1,2,3,None],'b':[4,5,None,6],'c':[None,None,7,8]})
print df
df.fillna(value=0, inplace=True)
print df

The output:

     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  NaN  7.0
3  NaN  6.0  8.0
     a    b    c
0  1.0  4.0  0.0
1  2.0  5.0  0.0
2  3.0  0.0  7.0
3  0.0  6.0  8.0

It replaces every None with 0‘s. What I want to do is, only replace Nones in columns a and b, but not c.

What is the best way of doing this?


回答 0

您可以选择所需的列并通过分配来完成:

df[['a', 'b']] = df[['a','b']].fillna(value=0)

结果输出与预期的一样:

     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

You can select your desired columns and do it by assignment:

df[['a', 'b']] = df[['a','b']].fillna(value=0)

The resulting output is as expected:

     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

回答 1

您可以使用dictfillna与不同的列不同的价值

df.fillna({'a':0,'b':0})
Out[829]: 
     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

分配回去之后

df=df.fillna({'a':0,'b':0})
df
Out[831]: 
     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

You can using dict , fillna with different value for different column

df.fillna({'a':0,'b':0})
Out[829]: 
     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

After assign it back

df=df.fillna({'a':0,'b':0})
df
Out[831]: 
     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

回答 2

您可以避免使用Wen的解决方案和inplace = True复制对象:

df.fillna({'a':0, 'b':0}, inplace=True)
print(df)

生成:

     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

You can avoid making a copy of the object using Wen’s solution and inplace=True:

df.fillna({'a':0, 'b':0}, inplace=True)
print(df)

Which yields:

     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

回答 3

这是您可以在一行中完成所有操作的方法:

df[['a', 'b']].fillna(value=0, inplace=True)

细分:df[['a', 'b']]选择要为其填充NaN值的列,value=0告诉它为NaN填充零,inplace=True并使更改永久生效,而无需复制该对象。

Here’s how you can do it all in one line:

df[['a', 'b']].fillna(value=0, inplace=True)

Breakdown: df[['a', 'b']] selects the columns you want to fill NaN values for, value=0 tells it to fill NaNs with zero, and inplace=True will make the changes permanent, without having to make a copy of the object.


回答 4

使用最上面的答案会产生有关更改df切片副本的警告。假设您还有其他列,执行此操作的更好方法是传递字典:
df.fillna({'A': 'NA', 'B': 'NA'}, inplace=True)

using the top answer produces a warning about making changes to a copy of a df slice. Assuming that you have other columns, a better way to do this is to pass a dictionary:
df.fillna({'A': 'NA', 'B': 'NA'}, inplace=True)


回答 5

或类似的东西:

df.loc[df['a'].isnull(),'a']=0
df.loc[df['b'].isnull(),'b']=0

如果还有更多:

for i in your_list:
    df.loc[df[i].isnull(),i]=0

Or something like:

df.loc[df['a'].isnull(),'a']=0
df.loc[df['b'].isnull(),'b']=0

and if there is more:

for i in your_list:
    df.loc[df[i].isnull(),i]=0

回答 6

有时,此语法无法正常工作:

df[['col1','col2']] = df[['col1','col2']].fillna()

请改用以下内容:

df['col1','col2']

Sometimes this syntax wont work:

df[['col1','col2']] = df[['col1','col2']].fillna()

Use the following instead:

df['col1','col2']

如何在Pandas数据框中查找哪些列包含任何NaN值

问题:如何在Pandas数据框中查找哪些列包含任何NaN值

给定一个熊猫数据框,其中包含可能在此处和此处散布的NaN值:

问题:如何确定哪些列包含NaN值?特别是,可以获取包含NaN的列名称的列表吗?

Given a pandas dataframe containing possible NaN values scattered here and there:

Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?


回答 0

更新:使用熊猫0.22.0

较新的Pandas版本具有新的方法‘DataFrame.isna()’‘DataFrame.notna()’

In [71]: df
Out[71]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [72]: df.isna().any()
Out[72]:
a     True
b     True
c    False
dtype: bool

作为列列表:

In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']

选择这些列(至少包含一个NaN值):

In [73]: df.loc[:, df.isna().any()]
Out[73]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

旧答案:

尝试使用isnull()

In [97]: df
Out[97]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [98]: pd.isnull(df).sum() > 0
Out[98]:
a     True
b     True
c    False
dtype: bool

或作为@root建议的更清晰的版本:

In [5]: df.isnull().any()
Out[5]:
a     True
b     True
c    False
dtype: bool

In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']

选择一个子集-所有列至少包含一个NaN值:

In [31]: df.loc[:, df.isnull().any()]
Out[31]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

UPDATE: using Pandas 0.22.0

Newer Pandas versions have new methods ‘DataFrame.isna()’ and ‘DataFrame.notna()’

In [71]: df
Out[71]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [72]: df.isna().any()
Out[72]:
a     True
b     True
c    False
dtype: bool

as list of columns:

In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']

to select those columns (containing at least one NaN value):

In [73]: df.loc[:, df.isna().any()]
Out[73]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

OLD answer:

Try to use isnull():

In [97]: df
Out[97]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [98]: pd.isnull(df).sum() > 0
Out[98]:
a     True
b     True
c    False
dtype: bool

or as @root proposed clearer version:

In [5]: df.isnull().any()
Out[5]:
a     True
b     True
c    False
dtype: bool

In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']

to select a subset – all columns containing at least one NaN value:

In [31]: df.loc[:, df.isnull().any()]
Out[31]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

回答 1

您可以使用df.isnull().sum()。它显示了所有列以及每个功能的总NaN。

You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.


回答 2

我有一个问题,我必须在屏幕上目视检查许多列,因此筛选和返回有问题的列的简短列表组合是

nan_cols = [i for i in df.columns if df[i].isnull().any()]

如果这对任何人有帮助

I had a problem where I had to many columns to visually inspect on the screen so a short list comp that filters and returns the offending columns is

nan_cols = [i for i in df.columns if df[i].isnull().any()]

if that’s helpful to anyone


回答 3

在具有大量列的数据集中,最好查看有多少列包含空值而有多少列不包含空值。

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

例如,在我的数据框中,它包含82列,其中19列至少包含一个空值。

此外,您还可以自动删除cols和row,具体取决于哪个具有更多null值

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

注意:上面的代码删除了所有空值。如果需要空值,请先处理它们。

In datasets having large number of columns its even better to see how many columns contain null values and how many don’t.

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.

Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

Note: Above code removes all of your null values. If you want null values, process them before.


回答 4

我使用以下三行代码来打印出包含至少一个空值的列名:

for column in dataframe:
    if dataframe[column].isnull().any():
       print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))

i use these three lines of code to print out the column names which contain at least one null value:

for column in dataframe:
    if dataframe[column].isnull().any():
       print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))

回答 5

这两个都应该起作用:

df.isnull().sum()
df.isna().sum()

DataFrame方法isna()还是isnull()完全相同的。

注意:空字符串''被视为False(不视为NA)

Both of these should work:

df.isnull().sum()
df.isna().sum()

DataFrame methods isna() or isnull() are completely identical.

Note: Empty strings '' is considered as False (not considered NA)


回答 6

这对我有用

1.用于获取具有至少1个空值的列。(列名)

data.columns[data.isnull().any()]

2.用于获取具有count且具有至少1个空值的Columns。

data[data.columns[data.isnull().any()]].isnull().sum()

[可选] 3.用于获取空计数的百分比。

data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]

This worked for me,

1. For getting Columns having at least 1 null value. (column names)

data.columns[data.isnull().any()]

2. For getting Columns with count, with having at least 1 null value.

data[data.columns[data.isnull().any()]].isnull().sum()

[Optional] 3. For getting percentage of the null count.

data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]

修改熊猫数据框中的行的子集

问题:修改熊猫数据框中的行的子集

假设我有一个带有两列A和B的pandas DataFrame。我想修改此DataFrame(或创建一个副本),以便每当A为0时B始终为NaN。我将如何实现?

我尝试了以下

df['A'==0]['B'] = np.nan

df['A'==0]['B'].values.fill(np.nan)

没有成功。

Assume I have a pandas DataFrame with two columns, A and B. I’d like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?

I tried the following

df['A'==0]['B'] = np.nan

and

df['A'==0]['B'].values.fill(np.nan)

without success.


回答 0

使用.loc基于标签索引:

df.loc[df.A==0, 'B'] = np.nan

df.A==0表达式创建一个布尔系列,该系列对行进行索引,然后'B'选择列。您还可以使用它来转换列的子集,例如:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

我对pandas内部没有足够的了解,无法确切知道它为什么起作用,但是基本的问题是有时索引到DataFrame中会返回结果的副本,有时会返回原始对象的视图。根据此处的文档,此行为取决于基础的numpy行为。我发现在一个操作(而不是[one] [two])中访问所有内容更可能用于设置。

Use .loc for label based indexing:

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don’t know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I’ve found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.


回答 1

是有关高级索引的熊猫文档:

本节将确切说明您的需求!事实证明df.loc(如已弃用.ix -正如许多人在下面指出的那样)可以用于数据帧的酷切片/切块。和。它也可以用来设置事物。

df.loc[selection criteria, columns I want] = value

因此,布伦的回答是说“找到我所有的位置df.A == 0,选择列B并将其设置为np.nan

Here is from pandas docs on advanced indexing:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated — as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

df.loc[selection criteria, columns I want] = value

So Bren’s answer is saying ‘find me all the places where df.A == 0, select column B and set it to np.nan


回答 2

从熊猫0.20开始不推荐使用ix。正确的方法是使用df.loc

这是一个有效的例子

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>> 

说明:

如在doc解释这里.loc 主要是基于标签,但也可以用布尔阵列使用

因此,我们在上面所做的是df.loc[row_index, column_index]通过以下方式应用的:

  • 利用loc可以将布尔数组作为掩码的事实,该掩码告诉熊猫我们要更改的行的子集row_index
  • 利用这样的事实loc也是基于标签来选择使用标签列'B'column_index

我们可以使用逻辑,条件或返回一系列布尔值的任何操作来构造布尔值数组。在上面的示例中,我们希望rows包含的任何对象都0可以使用df.A == 0,因为您可以在下面的示例中看到,这将返回一系列布尔值。

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>> 

然后,我们使用上面的布尔数组选择和修改必要的行:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

有关更多信息,请在此处查看高级索引文档。

Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc

here is a working example

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>> 

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

So, what we are doing above is applying df.loc[row_index, column_index] by:

  • Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
  • Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>> 

Then, we use the above array of booleans to select and modify the necessary rows:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.


回答 3

要大幅提高速度,请使用NumPy的where函数。

建立

创建一个两列DataFrame,其中包含100,000行,其中一些零。

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

快速解决方案 numpy.where

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

时机

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy的where速度快约4倍

For a massive speed increase, use NumPy’s where function.

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with numpy.where

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy’s where is about 4x faster


回答 4

要替换多个列,请使用转换为numpy数组.values

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

To replace multiples columns convert to numpy array using .values:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2