问题:如何基于列值从DataFrame中选择行?

如何DataFrame基于Python Pandas中某些列的值从中选择行?

在SQL中,我将使用:

SELECT *
FROM table
WHERE colume_name = some_value

我试图查看熊猫文档,但没有立即找到答案。

How to select rows from a DataFrame based on values in some column in Python Pandas?

In SQL, I would use:

SELECT *
FROM table
WHERE colume_name = some_value

I tried to look at pandas documentation but did not immediately find the answer.


回答 0

要选择列值等于标量的行some_value,请使用==

df.loc[df['column_name'] == some_value]

要选择列值处于可迭代状态的行some_values,请使用isin

df.loc[df['column_name'].isin(some_values)]

将多个条件与&

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

注意括号。由于Python的运算符优先级规则&绑定比<=和更紧密>=。因此,最后一个示例中的括号是必需的。没有括号

df['column_name'] >= A & df['column_name'] <= B

被解析为

df['column_name'] >= (A & df['column_name']) <= B

这导致一个系列真值是模棱两可的错误


要选择列值不相等的行 some_value,请使用!=

df.loc[df['column_name'] != some_value]

isin返回一个布尔系列,因此要选择值不在 in的行,请some_values使用~以下命令对布尔系列求反:

df.loc[~df['column_name'].isin(some_values)]

例如,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

Yield

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

如果要包含多个值,请将它们放在列表中(或更普遍地说,是任何可迭代的值)并使用isin

print(df.loc[df['B'].isin(['one','three'])])

Yield

     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

但是请注意,如果您希望多次执行此操作,则先创建索引然后再使用会更有效df.loc

df = df.set_index(['B'])
print(df.loc['one'])

Yield

       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

或者,要包含来自索引的多个值,请使用df.index.isin

df.loc[df.index.isin(['one','two'])]

Yield

       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python’s operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

is parsed as

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.


To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12

回答 1

有几种方法可以从熊猫数据框中选择行:

  1. 布尔索引(df[df['col'] == value]
  2. 位置索引(df.iloc[...]
  3. 标签索引(df.xs(...)
  4. df.query(...) API

下面,我为您展示每种方法的示例,并提供何时使用某些技术的建议。假设我们的标准是列'A'=='foo'

(有关性能的说明:对于每种基本类型,我们可以使用pandas API简化事情,也可以冒险使用API​​之外的numpy东西,通常使用,并加快速度。)


设置
我们首先需要确定一个条件,该条件将作为选择行的标准。我们将从OP的案例开始column_name == some_value,并包括其他一些常见的使用案例。

从@unutbu借来的:

import pandas as pd, numpy as np

df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})

1.布尔索引

…布尔索引需要找到'A'等于的每一行列的真实值'foo',然后使用这些真实值来标识要保留的行。通常,我们将这个系列命名为真值数组mask。我们也会在这里这样做。

mask = df['A'] == 'foo'

然后,我们可以使用此掩码对数据帧进行切片或索引

df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

这是完成此任务的最简单方法之一,如果性能或直观性不成问题,则应选择此方法。但是,如果需要考虑性能,那么您可能需要考虑另一种创建的方法mask


2.位置索引

位置索引(df.iloc[...])有其用例,但这不是其中一种。为了确定在哪里切片,我们首先需要执行与上面相同的布尔分析。这使我们执行一个额外的步骤来完成相同的任务。

mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

3.标签索引

标签索引可以非常方便,但是在这种情况下,我们将再次做更多的工作而没有任何好处

df.set_index('A', append=True, drop=False).xs('foo', level=1)

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

4. df.query()API

pd.DataFrame.query是执行此任务的一种非常优雅/直观的方法,但通常速度较慢。但是,如果您注意以下时间安排,对于大数据,查询将非常有效。比标准方法更多,其幅度与我的最佳建议相似。

df.query('A == "foo"')

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

我的偏好是使用 Boolean mask

可以通过修改创建方式来进行实际改进Boolean mask

mask替代方案1
使用基础numpy数组,放弃创建另一个数组的开销pd.Series

mask = df['A'].values == 'foo'

最后,我将显示更完整的时间测试,但请看一下使用示例数据帧所获得的性能提升。首先,我们来看一下创建mask

%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'

5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

mask使用numpy数组评估大约快30倍。部分原因是numpy评估速度通常更快。这也部分是由于缺少建立索引和相应pd.Series对象所需的开销。

接下来,我们将看一下一个切片mask相对另一个切片的时间。

mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]

219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

性能提升并不明显。我们将看看这是否可以阻止更强大的测试。



我们也可以重建数据帧。重建数据帧时有一个很大的警告- dtypes这样做时必须注意!

而不是df[mask]我们会这样做

pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

如果数据帧是混合类型(在我们的示例中是),那么当得到df.values的数组dtype object为时,新数据帧的所有列将为dtype object。因此要求astype(df.dtypes)并杀死任何潜在的性能提升。

%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

但是,如果数据帧不是混合类型,这是一种非常有用的方法。

给定

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

d1

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5    

%%timeit
mask = d1['A'].values == 7
d1[mask]

179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)

87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我们把时间缩短了一半。



@unutbu还向我们展示了如何使用一组值pd.Series.isin来说明每个元素df['A']。如果我们的一组值是一组一个值,即,这将得出相同的结果'foo'。但是,如果需要,它也可以概括为包含更大的值集。事实证明,即使这是一个更通用的解决方案,它仍然相当快。对于不熟悉该概念的人来说,唯一的真正损失就是直观性。

mask = df['A'].isin(['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

但是,和以前一样,我们可以利用它numpy来提高性能,同时几乎不牺牲任何内容。我们将使用np.in1d

mask = np.in1d(df['A'].values, ['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

时间安排
我将包括其他帖子中提到的其他概念,以供参考。
下面的代码

该表中的每个列代表一个不同长度的数据帧,我们在该数据帧上测试每个功能。每列均显示相对时间,其中最快的功能的基本索引为1.0

res.div(res.min())

                         10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175

您会注意到,最快的时间似乎在mask_with_values和之间共享mask_with_in1d

res.T.plot(loglog=True)

在此处输入图片说明

功能

def mask_standard(df):
    mask = df['A'] == 'foo'
    return df[mask]

def mask_standard_loc(df):
    mask = df['A'] == 'foo'
    return df.loc[mask]

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_values_loc(df):
    mask = df['A'].values == 'foo'
    return df.loc[mask]

def query(df):
    return df.query('A == "foo"')

def xs_label(df):
    return df.set_index('A', append=True, drop=False).xs('foo', level=-1)

def mask_with_isin(df):
    mask = df['A'].isin(['foo'])
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

测试中

res = pd.DataFrame(
    index=[
        'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
        'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
    ],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

for j in res.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in res.index:a
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        res.at[i, j] = timeit(stmt, setp, number=50)

特殊时序
查看dtype整个数据帧中只有一个非对象的特殊情况。 下面的代码

spec.div(spec.min())

                     10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735

事实证明,重建数百行不值得。

spec.T.plot(loglog=True)

在此处输入图片说明

功能

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

def reconstruct(df):
    v = df.values
    mask = np.in1d(df['A'].values, ['foo'])
    return pd.DataFrame(v[mask], df.index[mask], df.columns)

spec = pd.DataFrame(
    index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

测试中

for j in spec.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in spec.index:
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        spec.at[i, j] = timeit(stmt, setp, number=50)

There are several ways to select rows from a pandas data frame:

  1. Boolean indexing (df[df['col'] == value] )
  2. Positional indexing (df.iloc[...])
  3. Label indexing (df.xs(...))
  4. df.query(...) API

Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A' == 'foo'

(Note on performance: For each base type, we can keep things simple by using the pandas API or we can venture outside the API, usually into numpy, and speed things up.)


Setup
The first thing we’ll need is to identify a condition that will act as our criterion for selecting rows. We’ll start with the OP’s case column_name == some_value, and include some other common use cases.

Borrowing from @unutbu:

import pandas as pd, numpy as np

df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})

1. Boolean indexing

… Boolean indexing requires finding the true value of each row’s 'A' column being equal to 'foo', then using those truth values to identify which rows to keep. Typically, we’d name this series, an array of truth values, mask. We’ll do so here as well.

mask = df['A'] == 'foo'

We can then use this mask to slice or index the data frame

df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn’t an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.


2. Positional indexing

Positional indexing (df.iloc[...]) has its use cases, but this isn’t one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

3. Label indexing

Label indexing can be very handy, but in this case, we are again doing more work for no benefit

df.set_index('A', append=True, drop=False).xs('foo', level=1)

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

4. df.query() API

pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

df.query('A == "foo"')

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

My preference is to use the Boolean mask

Actual improvements can be made by modifying how we create our Boolean mask.

mask alternative 1
Use the underlying numpy array and forgo the overhead of creating another pd.Series

mask = df['A'].values == 'foo'

I’ll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask

%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'

5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Evaluating the mask with the numpy array is ~ 30 times faster. This is partly due to numpy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.

Next, we’ll look at the timing for slicing with one mask versus the other.

mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]

219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance gains aren’t as pronounced. We’ll see if this holds up over more robust testing.



We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!

Instead of df[mask] we will do this

pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.

%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, if the data frame is not of mixed type, this is a very useful way to do it.

Given

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

d1

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5    

%%timeit
mask = d1['A'].values == 7
d1[mask]

179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Versus

%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)

87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We cut the time in half.



@unutbu also shows us how to use pd.Series.isin to account for each element of df['A'] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely 'foo'. But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.

mask = df['A'].isin(['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

However, as before, we can utilize numpy to improve performance while sacrificing virtually nothing. We’ll use np.in1d

mask = np.in1d(df['A'].values, ['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing
I’ll include other concepts mentioned in other posts as well for reference.
Code Below

Each Column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.

res.div(res.min())

                         10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175

You’ll notice that fastest times seem to be shared between mask_with_values and mask_with_in1d

res.T.plot(loglog=True)

enter image description here

Functions

def mask_standard(df):
    mask = df['A'] == 'foo'
    return df[mask]

def mask_standard_loc(df):
    mask = df['A'] == 'foo'
    return df.loc[mask]

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_values_loc(df):
    mask = df['A'].values == 'foo'
    return df.loc[mask]

def query(df):
    return df.query('A == "foo"')

def xs_label(df):
    return df.set_index('A', append=True, drop=False).xs('foo', level=-1)

def mask_with_isin(df):
    mask = df['A'].isin(['foo'])
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

Testing

res = pd.DataFrame(
    index=[
        'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
        'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
    ],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

for j in res.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in res.index:a
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        res.at[i, j] = timeit(stmt, setp, number=50)

Special Timing
Looking at the special case when we have a single non-object dtype for the entire data frame. Code Below

spec.div(spec.min())

                     10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735

Turns out, reconstruction isn’t worth it past a few hundred rows.

spec.T.plot(loglog=True)

enter image description here

Functions

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

def reconstruct(df):
    v = df.values
    mask = np.in1d(df['A'].values, ['foo'])
    return pd.DataFrame(v[mask], df.index[mask], df.columns)

spec = pd.DataFrame(
    index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

Testing

for j in spec.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in spec.index:
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        spec.at[i, j] = timeit(stmt, setp, number=50)

回答 2

tl; dr

大熊猫相当于

select * from table where column_name = some_value

table[table.column_name == some_value]

多个条件:

table[(table.column_name == some_value) | (table.column_name2 == some_value2)]

要么

table.query('column_name == some_value | column_name2 == some_value2')

代码示例

import pandas as pd

# Create data set
d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555]}
df = pd.DataFrame(d)

# Full dataframe:
df

# Shows:
#    bar   foo 
# 0  333   100
# 1  444   111
# 2  555   222

# Output only the row(s) in df where foo is 222:
df[df.foo == 222]

# Shows:
#    bar  foo
# 2  555  222

在上面的代码中df[df.foo == 222]222在这种情况下,是根据行值给出行的行。

多种条件也是可能的:

df[(df.foo == 222) | (df.bar == 444)]
#    bar  foo
# 1  444  111
# 2  555  222

但是在那一点上,我建议使用查询函数,因为它不太冗长,并且产生的结果相同:

df.query('foo == 222 | bar == 444')

tl;dr

The pandas equivalent to

select * from table where column_name = some_value

is

table[table.column_name == some_value]

Multiple conditions:

table[(table.column_name == some_value) | (table.column_name2 == some_value2)]

or

table.query('column_name == some_value | column_name2 == some_value2')

Code example

import pandas as pd

# Create data set
d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555]}
df = pd.DataFrame(d)

# Full dataframe:
df

# Shows:
#    bar   foo 
# 0  333   100
# 1  444   111
# 2  555   222

# Output only the row(s) in df where foo is 222:
df[df.foo == 222]

# Shows:
#    bar  foo
# 2  555  222

In the above code it is the line df[df.foo == 222] that gives the rows based on the column value, 222 in this case.

Multiple conditions are also possible:

df[(df.foo == 222) | (df.bar == 444)]
#    bar  foo
# 1  444  111
# 2  555  222

But at that point I would recommend using the query function, since it’s less verbose and yields the same result:

df.query('foo == 222 | bar == 444')

回答 3

我发现先前答案的语法是多余的,很难记住。Pandas query()在v0.13中引入了该方法,我更喜欢它。对于你的问题,你可以做df.query('col == val')

转载自http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query

In [167]: n = 10

In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [169]: df
Out[169]: 
          a         b         c
0  0.687704  0.582314  0.281645
1  0.250846  0.610021  0.420121
2  0.624328  0.401816  0.932146
3  0.011763  0.022921  0.244186
4  0.590198  0.325680  0.890392
5  0.598892  0.296424  0.007312
6  0.634625  0.803069  0.123872
7  0.924168  0.325076  0.303746
8  0.116822  0.364564  0.454607
9  0.986142  0.751953  0.561512

# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

您还可以在环境中添加一个来访问变量@

exclude = ('red', 'orange')
df.query('color not in @exclude')

I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query() method in v0.13 and I much prefer it. For your question, you could do df.query('col == val')

Reproduced from http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query

In [167]: n = 10

In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [169]: df
Out[169]: 
          a         b         c
0  0.687704  0.582314  0.281645
1  0.250846  0.610021  0.420121
2  0.624328  0.401816  0.932146
3  0.011763  0.022921  0.244186
4  0.590198  0.325680  0.890392
5  0.598892  0.296424  0.007312
6  0.634625  0.803069  0.123872
7  0.924168  0.325076  0.303746
8  0.116822  0.364564  0.454607
9  0.986142  0.751953  0.561512

# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

You can also access variables in the environment by prepending an @.

exclude = ('red', 'orange')
df.query('color not in @exclude')

回答 4

使用时.query具有更大的灵活性pandas >= 0.25.0

2019年8月更新的答案

因为pandas >= 0.25.0我们可以使用querypandas方法甚至带有空格的列名来使用该方法过滤数据帧。通常,列名中的空格会产生错误,但是现在我们可以使用反引号(`)来解决该问题,请参见GitHub

# Example dataframe
df = pd.DataFrame({'Sender email':['ex@example.com', "reply@shop.com", "buy@shop.com"]})

     Sender email
0  ex@example.com
1  reply@shop.com
2    buy@shop.com

使用.querywith方法str.endswith

df.query('`Sender email`.str.endswith("@shop.com")')

输出量

     Sender email
1  reply@shop.com
2    buy@shop.com

我们也可以@在查询中以前缀作为局部变量来使用局部变量:

domain = 'shop.com'
df.query('`Sender email`.str.endswith(@domain)')

输出量

     Sender email
1  reply@shop.com
2    buy@shop.com

More flexibility using .query with pandas >= 0.25.0:

August 2019 updated answer

Since pandas >= 0.25.0 we can use the query method to filter dataframes with pandas methods and even column names which have spaces. Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) see GitHub:

# Example dataframe
df = pd.DataFrame({'Sender email':['ex@example.com', "reply@shop.com", "buy@shop.com"]})

     Sender email
0  ex@example.com
1  reply@shop.com
2    buy@shop.com

Using .query with method str.endswith:

df.query('`Sender email`.str.endswith("@shop.com")')

Output

     Sender email
1  reply@shop.com
2    buy@shop.com

Also we can use local variables by prefixing it with an @ in our query:

domain = 'shop.com'
df.query('`Sender email`.str.endswith(@domain)')

Output

     Sender email
1  reply@shop.com
2    buy@shop.com

回答 5

使用numpy.where可以获得更快的结果。

例如,使用unubtu的设置

In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]: 
     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

时序比较:

In [68]: %timeit df.iloc[np.where(df.A.values=='foo')]  # fastest
1000 loops, best of 3: 380 µs per loop

In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop

In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop

In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop

In [74]: %timeit df.query('(A=="foo")')  # slowest
1000 loops, best of 3: 1.71 ms per loop

Faster results can be achieved using numpy.where.

For example, with unubtu’s setup

In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]: 
     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing comparisons:

In [68]: %timeit df.iloc[np.where(df.A.values=='foo')]  # fastest
1000 loops, best of 3: 380 µs per loop

In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop

In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop

In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop

In [74]: %timeit df.query('(A=="foo")')  # slowest
1000 loops, best of 3: 1.71 ms per loop

回答 6

这是一个简单的例子

from pandas import DataFrame

# Create data set
d = {'Revenue':[100,111,222], 
     'Cost':[333,444,555]}
df = DataFrame(d)


# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111

print mask

# Result:
# 0    False
# 1     True
# 2    False
# Name: Revenue, dtype: bool


# Select * FROM df WHERE Revenue = 111
df[mask]

# Result:
#    Cost    Revenue
# 1  444     111

Here is a simple example

from pandas import DataFrame

# Create data set
d = {'Revenue':[100,111,222], 
     'Cost':[333,444,555]}
df = DataFrame(d)


# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111

print mask

# Result:
# 0    False
# 1     True
# 2    False
# Name: Revenue, dtype: bool


# Select * FROM df WHERE Revenue = 111
df[mask]

# Result:
#    Cost    Revenue
# 1  444     111

回答 7

从熊猫的给定值中,仅从多个列中选择特定的列:

select col_name1, col_name2 from table where column_name = some_value.

选项:

df.loc[df['column_name'] == some_value][[col_name1, col_name2]]

要么

df.query['column_name' == 'some_value'][[col_name1, col_name2]]

For selecting only specific columns out of multiple columns for a given value in pandas:

select col_name1, col_name2 from table where column_name = some_value.

Options:

df.loc[df['column_name'] == some_value][[col_name1, col_name2]]

or

df.query['column_name' == 'some_value'][[col_name1, col_name2]]

回答 8

附加到这个著名的问题(虽然为时已晚):您还可以df.groupby('column_name').get_group('column_desired_value').reset_index()使用指定列具有特定值的方法来制作新的数据框。例如

import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)

b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1) 
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)

运行此给出:

Original dataframe:
     A      B
0  foo    one
1  bar    one
2  foo    two
3  bar  three
4  foo    two
5  bar    two
6  foo    one
7  foo  three
Sub dataframe where B is two:
     A    B
0  foo  two
1  foo  two
2  bar  two

To append to this famous question (though a bit too late): You can also do df.groupby('column_name').get_group('column_desired_value').reset_index() to make a new data frame with specified column having a particular value. E.g.

import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)

b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1) 
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)

Run this gives:

Original dataframe:
     A      B
0  foo    one
1  bar    one
2  foo    two
3  bar  three
4  foo    two
5  bar    two
6  foo    one
7  foo  three
Sub dataframe where B is two:
     A    B
0  foo  two
1  foo  two
2  bar  two

回答 9

您也可以使用.apply:

df.apply(lambda row: row[df['B'].isin(['one','three'])])

它实际上是逐行工作的(即,将函数应用于每一行)。

输出是

   A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

结果与@unutbu提到的使用相同

df[[df['B'].isin(['one','three'])]]

You can also use .apply:

df.apply(lambda row: row[df['B'].isin(['one','three'])])

It actually works row-wise (i.e., applies the function to each row).

The output is

   A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

The results is the same as using as mentioned by @unutbu

df[[df['B'].isin(['one','three'])]]

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。