标签归档:filtering

过滤python词典中的项,其中键包含特定的字符串

问题:过滤python词典中的项,其中键包含特定的字符串

我是用python开发东西的C编码器。我知道如何在C语言中执行以下操作(以及因此在应用于Python的类似C的逻辑中),但是我想知道这样做的“ Python”方式是什么。

我有一个字典d,我想对项的子集进行操作,只有那些键(字符串)的项包含特定的子字符串。

即C逻辑将是:

for key in d:
    if filter_string in key:
        # do something
    else
        # do nothing, continue

我在想python版本会像

filtered_dict = crazy_python_syntax(d, substring)
for key,value in filtered_dict.iteritems():
    # do something

我在这里找到了很多有关过滤字典的文章,但是找不到与之相关的文章。

我的字典未嵌套,我正在使用python 2.7

I’m a C coder developing something in python. I know how to do the following in C (and hence in C-like logic applied to python), but I’m wondering what the ‘Python’ way of doing it is.

I have a dictionary d, and I’d like to operate on a subset of the items, only those who’s key (string) contains a specific substring.

i.e. the C logic would be:

for key in d:
    if filter_string in key:
        # do something
    else
        # do nothing, continue

I’m imagining the python version would be something like

filtered_dict = crazy_python_syntax(d, substring)
for key,value in filtered_dict.iteritems():
    # do something

I’ve found a lot of posts on here regarding filtering dictionaries, but couldn’t find one which involved exactly this.

My dictionary is not nested and i’m using python 2.7


回答 0

字典理解如何:

filtered_dict = {k:v for k,v in d.iteritems() if filter_string in k}

您所看到的它应该是不言自明的,因为它的英语读起来很好。

此语法要求Python 2.7或更高版本。

在Python 3中,只有dict.items()iteritems()所以您可以使用:

filtered_dict = {k:v for (k,v) in d.items() if filter_string in k}

How about a dict comprehension:

filtered_dict = {k:v for k,v in d.iteritems() if filter_string in k}

One you see it, it should be self-explanatory, as it reads like English pretty well.

This syntax requires Python 2.7 or greater.

In Python 3, there is only dict.items(), not iteritems() so you would use:

filtered_dict = {k:v for (k,v) in d.items() if filter_string in k}

回答 1

选择最易读和易于维护的内容。仅仅因为您可以将其写成一行并不意味着您应该这样做。您现有的解决方案与我将要使用的迭代器跳过用户查找值的方法很接近,并且我讨厌如果不能避免,则使用嵌套的ifs:

for key, val in d.iteritems():
    if filter_string not in key:
        continue
    # do something

但是,如果您确实想要让您迭代筛选的dict的东西,那么我将不会执行构建筛选的dict然后对其进行迭代的两步过程,而是使用生成器,因为比pythonic(和超赞的)要好得多生成器?

首先,我们创建我们的生成器,并且良好的设计要求我们使它足够抽象以便可重用:

# The implementation of my generator may look vaguely familiar, no?
def filter_dict(d, filter_string):
    for key, val in d.iteritems():
        if filter_string not in key:
            continue
        yield key, val

然后,我们可以使用生成器通过简单易懂的代码很好地,干净地解决您的问题:

for key, val in filter_dict(d, some_string):
    # do something

简而言之:生成器很棒。

Go for whatever is most readable and easily maintainable. Just because you can write it out in a single line doesn’t mean that you should. Your existing solution is close to what I would use other than I would user iteritems to skip the value lookup, and I hate nested ifs if I can avoid them:

for key, val in d.iteritems():
    if filter_string not in key:
        continue
    # do something

However if you realllly want something to let you iterate through a filtered dict then I would not do the two step process of building the filtered dict and then iterating through it, but instead use a generator, because what is more pythonic (and awesome) than a generator?

First we create our generator, and good design dictates that we make it abstract enough to be reusable:

# The implementation of my generator may look vaguely familiar, no?
def filter_dict(d, filter_string):
    for key, val in d.iteritems():
        if filter_string not in key:
            continue
        yield key, val

And then we can use the generator to solve your problem nice and cleanly with simple, understandable code:

for key, val in filter_dict(d, some_string):
    # do something

In short: generators are awesome.


回答 2

您可以使用内置的过滤器功能根据特定条件过滤字典,列表等。

filtered_dict = dict(filter(lambda item: filter_str in item[0], d.items()))

优点是您可以将其用于不同的数据结构。

You can use the built-in filter function to filter dictionaries, lists, etc. based on specific conditions.

filtered_dict = dict(filter(lambda item: filter_str in item[0], d.items()))

The advantage is that you can use it for different data structures.


回答 3

input = {"A":"a", "B":"b", "C":"c"}
output = {k:v for (k,v) in input.items() if key_satifies_condition(k)}
input = {"A":"a", "B":"b", "C":"c"}
output = {k:v for (k,v) in input.items() if key_satifies_condition(k)}

回答 4

乔纳森(Jonathon)在他的回答中给了你运用字典理解的方法。这是处理您要做的事情的一种方法。

如果您想对字典的值做一些事情,则根本不需要字典理解:

我正在使用iteritems(),因为您用标记了您的问题

results = map(some_function, [(k,v) for k,v in a_dict.iteritems() if 'foo' in k])

现在,结果将出现在列表中,该列表some_function应用于已包含foo在其键中的字典的每个键/值对。

如果只想处理值并忽略键,则只需更改列表理解即可:

results = map(some_function, [v for k,v in a_dict.iteritems() if 'foo' in k])

some_function 可以是任何可调用的,因此lambda也可以工作:

results = map(lambda x: x*2, [v for k,v in a_dict.iteritems() if 'foo' in k])

内部列表实际上不是必需的,因为您还可以传递生成器表达式来映射:

>>> map(lambda a: a[0]*a[1], ((k,v) for k,v in {2:2, 3:2}.iteritems() if k == 2))
[4]

Jonathon gave you an approach using dict comprehensions in his answer. Here is an approach that deals with your do something part.

If you want to do something with the values of the dictionary, you don’t need a dictionary comprehension at all:

I’m using iteritems() since you tagged your question with

results = map(some_function, [(k,v) for k,v in a_dict.iteritems() if 'foo' in k])

Now the result will be in a list with some_function applied to each key/value pair of the dictionary, that has foo in its key.

If you just want to deal with the values and ignore the keys, just change the list comprehension:

results = map(some_function, [v for k,v in a_dict.iteritems() if 'foo' in k])

some_function can be any callable, so a lambda would work as well:

results = map(lambda x: x*2, [v for k,v in a_dict.iteritems() if 'foo' in k])

The inner list is actually not required, as you can pass a generator expression to map as well:

>>> map(lambda a: a[0]*a[1], ((k,v) for k,v in {2:2, 3:2}.iteritems() if k == 2))
[4]

在日期上过滤熊猫数据框

问题:在日期上过滤熊猫数据框

我有一个带有“日期”列的Pandas DataFrame。现在,我需要过滤掉DataFrame中日期在接下来两个月之外的所有行。本质上,我只需要保留接下来两个月内的行。

实现此目标的最佳方法是什么?

I have a Pandas DataFrame with a ‘date’ column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.

What is the best way to achieve this?


回答 0

如果date列是索引,则将.loc用于基于标签的索引,将.iloc用于位置索引。

例如:

df.loc['2014-01-01':'2014-02-01']

在此处查看详细信息http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection

如果列不是索引,则有两个选择:

  1. 使其成为索引(如果是时间序列数据,则为临时索引或永久索引)
  2. df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]

有关一般说明,请参见此处

注意:不建议使用.ix。

If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.

For example:

df.loc['2014-01-01':'2014-02-01']

See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection

If the column is not the index you have two choices:

  1. Make it the index (either temporarily or permanently if it’s time-series data)
  2. df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]

See here for the general explanation

Note: .ix is deprecated.


回答 1

根据我的经验,上一个答案是不正确的,您不能将其传递为简单的字符串,而必须是datetime对象。所以:

import datetime 
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]

Previous answer is not correct in my experience, you can’t pass it a simple string, needs to be a datetime object. So:

import datetime 
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]

回答 2

而且,如果通过导入datetime包将日期标准化,则可以简单地使用:

df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]  

为了使用datetime包标准化日期字符串,可以使用以下功能:

import datetime
datetime.datetime.strptime

And if your dates are standardized by importing datetime package, you can simply use:

df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]  

For standarding your date string using datetime package, you can use this function:

import datetime
datetime.datetime.strptime

回答 3

如果您的datetime列具有Pandas datetime类型(例如datetime64[ns]),则为了进行正确的过滤,您需要pd.Timestamp对象,例如:

from datetime import date

import pandas as pd

value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]

If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:

from datetime import date

import pandas as pd

value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]

回答 4

如果日期在索引中,则只需:

df['20160101':'20160301']

If the dates are in the index then simply:

df['20160101':'20160301']

回答 5

您可以使用pd.Timestamp执行查询和本地引用

import pandas as pd
import numpy as np

df = pd.DataFrame()
ts = pd.Timestamp

df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')

print(df)
print(df.query('date > @ts("20190515T071320")')

与输出

                 date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25


                 date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25

看一下DataFrame.query的pandas文档,特别是有关本地变量引用的udsing @前缀的提及。在这种情况下,我们pd.Timestamp使用本地别名ts进行引用,以便能够提供时间戳字符串

You can use pd.Timestamp to perform a query and a local reference

import pandas as pd
import numpy as np

df = pd.DataFrame()
ts = pd.Timestamp

df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')

print(df)
print(df.query('date > @ts("20190515T071320")')

with the output

                 date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25


                 date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25

Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing @ prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string


回答 6

因此,在加载csv数据文件时,我们需要如下所示将date列设置为索引,以便根据日期范围过滤数据。现在不推荐使用的方法:pd.DataFrame.from_csv()不需要此功能。

如果您只想显示一月至二月两个月的数据,例如2020-01-01至2020-02-29,则可以执行以下操作:

import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost'] 

已针对Python 3.7进行了测试。希望您会发现这个有用。

So when loading the csv data file, we’ll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().

If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:

import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost'] 

This has been tested working for Python 3.7. Hope you will find this useful.


回答 7

怎么样使用 pyjanitor

它具有很酷的功能。

pip install pyjanitor

import janitor

df_filtered = df.filter_date(your_date_column_name, start_date, end_date)

How about using pyjanitor

It has cool features.

After pip install pyjanitor

import janitor

df_filtered = df.filter_date(your_date_column_name, start_date, end_date)

回答 8

按日期过滤数据框的最短方法:假设您的日期列为datetime64 [ns]类型

# filter by single day
df = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']

# filter by single month
df = df[df['date'].dt.strftime('%Y-%m') == '2014-01']

# filter by single year
df = df[df['date'].dt.strftime('%Y') == '2014']

The shortest way to filter your dataframe by date: Lets suppose your date column is type of datetime64[ns]

# filter by single day
df = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']

# filter by single month
df = df[df['date'].dt.strftime('%Y-%m') == '2014-01']

# filter by single year
df = df[df['date'].dt.strftime('%Y') == '2014']

回答 9

我尚未被允许发表任何评论,所以如果有人可以阅读所有评论并达到目的,我将写一个答案。

如果数据集的索引是日期时间,并且您只想按(例如)个月筛选,则可以执行以下操作:

df.loc[df.index.month = 3]

这将在三月之前为您过滤数据集。

I’m not allowed to write any comments yet, so I’ll write an answer, if somebody will read all of them and reach this one.

If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:

df.loc[df.index.month = 3]

That will filter the dataset for you by March.


回答 10

您可以通过执行以下操作来选择时间范围:df.loc [‘start_date’:’end_date’]

You could just select the time range by doing: df.loc[‘start_date’:’end_date’]


回答 11

如果您已经使用pd.to_datetime将字符串转换为日期格式,则可以使用:

df = df[(df['Date']> "2018-01-01") & (df['Date']< "2019-07-01")]

If you have already converted the string to a date format using pd.to_datetime you can just use:

df = df[(df['Date']> "2018-01-01") & (df['Date']< "2019-07-01")]


熊猫中布尔索引的逻辑运算符

问题:熊猫中布尔索引的逻辑运算符

我正在Pandas中使用布尔值索引。问题是为什么声明:

a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]

工作正常而

a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]

退出错误?

例:

a=pd.DataFrame({'x':[1,1],'y':[10,20]})

In: a[(a['x']==1)&(a['y']==10)]
Out:    x   y
     0  1  10

In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous.     Use a.any() or a.all()

I’m working with boolean index in Pandas. The question is why the statement:

a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]

works fine whereas

a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]

exits with error?

Example:

a=pd.DataFrame({'x':[1,1],'y':[10,20]})

In: a[(a['x']==1)&(a['y']==10)]
Out:    x   y
     0  1  10

In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous.     Use a.any() or a.all()

回答 0

当你说

(a['x']==1) and (a['y']==10)

您暗中要求Python进行转换(a['x']==1)并转换(a['y']==10)为布尔值。

NumPy数组(长度大于1)和Pandas对象(例如Series)没有布尔值-换句话说,它们引发

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

当用作布尔值时。这是因为尚不清楚何时应为True或False。如果某些用户的长度非零,则可能会认为它们为True,例如Python列表。其他人可能只希望它的所有元素都为真,才希望它为真。如果其他任何元素为True,则其他人可能希望它为True。

由于期望值如此之多,因此NumPy和Pandas的设计师拒绝猜测,而是提出了ValueError。

相反,你必须是明确的,通过调用empty()all()any()方法来表示你的愿望是什么行为。

但是,在这种情况下,您似乎不希望布尔值求值,而是希望按元素进行逻辑与。这就是&二进制运算符执行的操作:

(a['x']==1) & (a['y']==10)

返回一个布尔数组。


顺便说一句,正如alexpmil所指出的,括号是强制性的,因为&运算符优先级高于==。如果没有括号,a['x']==1 & a['y']==10则将被评估为a['x'] == (1 & a['y']) == 10等效于链式比较(a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10)。那是形式的表达Series and Seriesand与两个Series一起使用将再次触发与ValueError上述相同的操作。这就是为什么括号是强制性的。

When you say

(a['x']==1) and (a['y']==10)

You are implicitly asking Python to convert (a['x']==1) and (a['y']==10) to boolean values.

NumPy arrays (of length greater than 1) and Pandas objects such as Series do not have a boolean value — in other words, they raise

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

when used as a boolean value. That’s because its unclear when it should be True or False. Some users might assume they are True if they have non-zero length, like a Python list. Others might desire for it to be True only if all its elements are True. Others might want it to be True if any of its elements are True.

Because there are so many conflicting expectations, the designers of NumPy and Pandas refuse to guess, and instead raise a ValueError.

Instead, you must be explicit, by calling the empty(), all() or any() method to indicate which behavior you desire.

In this case, however, it looks like you do not want boolean evaluation, you want element-wise logical-and. That is what the & binary operator performs:

(a['x']==1) & (a['y']==10)

returns a boolean array.


By the way, as alexpmil notes, the parentheses are mandatory since & has a higher operator precedence than ==. Without the parentheses, a['x']==1 & a['y']==10 would be evaluated as a['x'] == (1 & a['y']) == 10 which would in turn be equivalent to the chained comparison (a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10). That is an expression of the form Series and Series. The use of and with two Series would again trigger the same ValueError as above. That’s why the parentheses are mandatory.


回答 1

TLDR;在熊猫逻辑运算符&|~和括号(...)是很重要的!

Python的andornot逻辑运算符的设计与标量的工作。因此,Pandas必须做得更好,并覆盖按位运算符,以实现此功能的矢量化(逐元素)版本。

因此,以下是python中的(exp1以及exp2是计算结果为布尔结果的表达式)…

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

…将转换为…

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

大熊猫。

如果在执行逻辑运算的过程中得到ValueError,则需要使用括号进行分组:

(exp1) op (exp2)

例如,

(df['col1'] == x) & (df['col2'] == y) 

等等。


布尔索引:常见的操作是通过逻辑条件来计算布尔掩码,以过滤数据。熊猫提供了三种运算符:&用于逻辑与,|逻辑或,以及~逻辑非。

请考虑以下设置:

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

逻辑与

对于df上面的内容,假设您想返回A <5和B> 5的所有行。这是通过分别计算每个条件的掩码并将它们与与完成的。

重载的按位&运算符
在继续之前,请注意文档的此特定摘录,其中指出

另一个常见的操作是使用布尔向量来过滤数据。运算符是:|for or&for and~for not这些必须通过使用括号来分组,由于由默认的Python将评估的表达式如df.A > 2 & df.B < 3df.A > (2 & df.B) < 3,而所期望的评价顺序是(df.A > 2) & (df.B < 3)

因此,考虑到这一点,可以使用按位运算符实现按元素逻辑与&

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

接下来的过滤步骤很简单,

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

括号用于覆盖按位运算符的默认优先级顺序,后者比条件运算符<和具有更高的优先级>。请参阅python文档中的“ 运算符优先级 ”部分。

如果不使用括号,则表达式的计算不正确。例如,如果您不小心尝试了诸如

df['A'] < 5 & df['B'] > 5

它被解析为

df['A'] < (5 & df['B']) > 5

变成

df['A'] < something_you_dont_want > 5

变成了(请参阅有关链接运算符比较的python文档),

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

变成

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

哪个抛出

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

所以,不要犯那个错误!1个

避免括号分组
该修补程序实际上非常简单。大多数运算符都有对应的DataFrame绑定方法。如果使用函数而不是条件运算符来构建单个掩码,则不再需要按括号分组以指定评估顺序:

df['A'].lt(5)

0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

请参阅“ 灵活比较 ”部分。总而言之,我们有

╒════╤════════════╤════════════╕
     Operator    Function   
╞════╪════════════╪════════════╡
  0  >           gt         
├────┼────────────┼────────────┤
  1  >=          ge         
├────┼────────────┼────────────┤
  2  <           lt         
├────┼────────────┼────────────┤
  3  <=          le         
├────┼────────────┼────────────┤
  4  ==          eq         
├────┼────────────┼────────────┤
  5  !=          ne         
╘════╧════════════╧════════════╛

避免括号的另一种方法是使用DataFrame.query(或eval):

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

我已经广泛地记录queryeval使用pd.eval动态表达评价大熊猫()

operator.and_
允许您以功能方式执行此操作。内部调用Series.__and__,它对应于按位运算符。

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

您通常不需要此功能,但了解它很有用。

概括:(np.logical_andlogical_and.reduce
另一种替代方法是使用 np.logical_and,它也不需要括号分组:

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_and是ufunc (通用函数),大多数ufunc都有一个reduce方法。这意味着logical_and如果对AND有多个掩码,则更容易一概而论。例如,对于AND遮罩m1以及m2m3&,您必须要做

m1 & m2 & m3

但是,一个更简单的选择是

np.logical_and.reduce([m1, m2, m3])

这很强大,因为它使您可以使用更复杂的逻辑在此基础上构建(例如,在列表理解中动态生成掩码并添加所有掩码):

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

1-我知道我在这一点上很困难,但是请耐心等待。这是一个非常非常常见的初学者的错误,必须非常有详尽的解释。


逻辑或

对于df上述内容,假设您想返回A == 3或B == 7的所有行。

按位重载 |

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7

0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

如果您还没有阅读过,请同时阅读上面“ 逻辑与 ”部分,所有注意事项均在此处适用。

或者,可以使用

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

operator.or_
在后台调用Series.__or__

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

np.logical_or
对于两个条件,请使用logical_or

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

对于多个口罩,请使用logical_or.reduce

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

逻辑非

给定口罩,例如

mask = pd.Series([True, True, False])

如果您需要反转每个布尔值(以使最终结果为[False, False, True]),则可以使用以下任何方法。

按位 ~

~mask

0    False
1    False
2     True
dtype: bool

同样,表达式需要加上括号。

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

这在内部调用

mask.__invert__()

0    False
1    False
2     True
dtype: bool

但是不要直接使用它。

operator.inv
内部调用__invert__该系列。

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

np.logical_not
这是numpy的变体。

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool

注意,np.logical_and可以代替np.bitwise_andlogical_orbitwise_or,并logical_notinvert

TLDR; Logical Operators in Pandas are &, | and ~, and parentheses (...) is important!

Python’s and, or and not logical operators are designed to work with scalars. So Pandas had to do one better and override the bitwise operators to achieve vectorized (element-wise) version of this functionality.

So the following in python (exp1 and exp2 are expressions which evaluate to a boolean result)…

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

…will translate to…

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

for pandas.

If in the process of performing logical operation you get a ValueError, then you need to use parentheses for grouping:

(exp1) op (exp2)

For example,

(df['col1'] == x) & (df['col2'] == y) 

And so on.


Boolean Indexing: A common operation is to compute boolean masks through logical conditions to filter the data. Pandas provides three operators: & for logical AND, | for logical OR, and ~ for logical NOT.

Consider the following setup:

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

Logical AND

For df above, say you’d like to return all rows where A < 5 and B > 5. This is done by computing masks for each condition separately, and ANDing them.

Overloaded Bitwise & Operator
Before continuing, please take note of this particular excerpt of the docs, which state

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3).

So, with this in mind, element wise logical AND can be implemented with the bitwise operator &:

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

And the subsequent filtering step is simply,

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

The parentheses are used to override the default precedence order of bitwise operators, which have higher precedence over the conditional operators < and >. See the section of Operator Precedence in the python docs.

If you do not use parentheses, the expression is evaluated incorrectly. For example, if you accidentally attempt something such as

df['A'] < 5 & df['B'] > 5

It is parsed as

df['A'] < (5 & df['B']) > 5

Which becomes,

df['A'] < something_you_dont_want > 5

Which becomes (see the python docs on chained operator comparison),

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

Which becomes,

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

Which throws

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So, don’t make that mistake!1

Avoiding Parentheses Grouping
The fix is actually quite simple. Most operators have a corresponding bound method for DataFrames. If the individual masks are built up using functions instead of conditional operators, you will no longer need to group by parens to specify evaluation order:

df['A'].lt(5)

0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

See the section on Flexible Comparisons.. To summarise, we have

╒════╤════════════╤════════════╕
│    │ Operator   │ Function   │
╞════╪════════════╪════════════╡
│  0 │ >          │ gt         │
├────┼────────────┼────────────┤
│  1 │ >=         │ ge         │
├────┼────────────┼────────────┤
│  2 │ <          │ lt         │
├────┼────────────┼────────────┤
│  3 │ <=         │ le         │
├────┼────────────┼────────────┤
│  4 │ ==         │ eq         │
├────┼────────────┼────────────┤
│  5 │ !=         │ ne         │
╘════╧════════════╧════════════╛

Another option for avoiding parentheses is to use DataFrame.query (or eval):

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

I have extensively documented query and eval in Dynamic Expression Evaluation in pandas using pd.eval().

operator.and_
Allows you to perform this operation in a functional manner. Internally calls Series.__and__ which corresponds to the bitwise operator.

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

You won’t usually need this, but it is useful to know.

Generalizing: np.logical_and (and logical_and.reduce)
Another alternative is using np.logical_and, which also does not need parentheses grouping:

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_and is a ufunc (Universal Functions), and most ufuncs have a reduce method. This means it is easier to generalise with logical_and if you have multiple masks to AND. For example, to AND masks m1 and m2 and m3 with &, you would have to do

m1 & m2 & m3

However, an easier option is

np.logical_and.reduce([m1, m2, m3])

This is powerful, because it lets you build on top of this with more complex logic (for example, dynamically generating masks in a list comprehension and adding all of them):

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

1 – I know I’m harping on this point, but please bear with me. This is a very, very common beginner’s mistake, and must be explained very thoroughly.


Logical OR

For the df above, say you’d like to return all rows where A == 3 or B == 7.

Overloaded Bitwise |

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7

0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

If you haven’t yet, please also read the section on Logical AND above, all caveats apply here.

Alternatively, this operation can be specified with

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

operator.or_
Calls Series.__or__ under the hood.

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

np.logical_or
For two conditions, use logical_or:

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

For multiple masks, use logical_or.reduce:

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

Logical NOT

Given a mask, such as

mask = pd.Series([True, True, False])

If you need to invert every boolean value (so that the end result is [False, False, True]), then you can use any of the methods below.

Bitwise ~

~mask

0    False
1    False
2     True
dtype: bool

Again, expressions need to be parenthesised.

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

This internally calls

mask.__invert__()

0    False
1    False
2     True
dtype: bool

But don’t use it directly.

operator.inv
Internally calls __invert__ on the Series.

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

np.logical_not
This is the numpy variant.

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool

Note, np.logical_and can be substituted for np.bitwise_and, logical_or with bitwise_or, and logical_not with invert.


回答 2

熊猫中布尔索引的逻辑运算符

要认识到,你不能使用任何的Python是很重要的逻辑运算符andornot上)pandas.Seriespandas.DataFrameS(同样,你不能上使用它们numpy.array有一个以上的元素S)。之所以不能使用它们,是因为它们隐式地调用bool其操作数,从而引发异常,因为这些数据结构确定数组的布尔值是不明确的:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我确实在回答“系列的真值不明确。请使用a.empty,a.bool(),a.item(),a.any()或a.all()”时回答这个问题。 + A

NumPys逻辑功能

然而NumPy的提供逐元素的操作等同于这些运营商的功能,可以在被使用numpy.arraypandas.Seriespandas.DataFrame,或任何其他(符合)numpy.array亚类:

因此,从本质上讲,应该使用(假设df1并且df2是pandas DataFrames):

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

布尔的按位函数和按位运算符

但是,如果您具有布尔NumPy数组,pandas系列或pandas DataFrame,则也可以使用按元素逐位的函数(对于布尔,它们与逻辑函数是(或至少应该是)不可区分的):

通常使用运算符。但是,当与比较运算符组合使用时,必须记住将比较括在括号中,因为按位运算符的优先级高于比较运算符

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

这可能很烦人,因为Python逻辑运算符的优先级比比较运算符的优先级低,因此您通常可以编写a < 10 and b > 10(其中a并且b是简单整数),并且不需要括号。

逻辑和按位运算之间的差异(非布尔值)

需要特别强调的是,位和逻辑运算仅对布尔NumPy数组(以及布尔Series和DataFrame)是等效的。如果这些不包含布尔值,则这些操作将给出不同的结果。我将包括使用NumPy数组的示例,但对于pandas数据结构,结果将相似:

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

而且由于NumPy(和类似的pandas)对boolean(布尔或“掩码”索引数组)和integer(Index数组)索引所做的操作不同,因此索引的结果也将不同:

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

汇总表

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

其中的逻辑运算符不适合与NumPy阵列工作,熊猫系列,和熊猫DataFrames。其他的则在这些数据结构(和普通的Python对象)上工作,并在元素方面工作。但是,请对纯Python上的按位取反要小心,bool因为在这种情况下,布尔将被解释为整数(例如~Falsereturn -1~Truereturn -2)。

Logical operators for boolean indexing in Pandas

It’s important to realize that you cannot use any of the Python logical operators (and, or or not) on pandas.Series or pandas.DataFrames (similarly you cannot use them on numpy.arrays with more than one element). The reason why you cannot use those is because they implicitly call bool on their operands which throws an Exception because these data structures decided that the boolean of an array is ambiguous:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I did cover this more extensively in my answer to the “Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()” Q+A.

NumPys logical functions

However NumPy provides element-wise operating equivalents to these operators as functions that can be used on numpy.array, pandas.Series, pandas.DataFrame, or any other (conforming) numpy.array subclass:

So, essentially, one should use (assuming df1 and df2 are pandas DataFrames):

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

Bitwise functions and bitwise operators for booleans

However in case you have boolean NumPy array, pandas Series, or pandas DataFrames you could also use the element-wise bitwise functions (for booleans they are – or at least should be – indistinguishable from the logical functions):

Typically the operators are used. However when combined with comparison operators one has to remember to wrap the comparison in parenthesis because the bitwise operators have a higher precedence than the comparison operators:

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

This may be irritating because the Python logical operators have a lower precendence than the comparison operators so you normally write a < 10 and b > 10 (where a and b are for example simple integers) and don’t need the parenthesis.

Differences between logical and bitwise operations (on non-booleans)

It is really important to stress that bit and logical operations are only equivalent for boolean NumPy arrays (and boolean Series & DataFrames). If these don’t contain booleans then the operations will give different results. I’ll include examples using NumPy arrays but the results will be similar for the pandas data structures:

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

And since NumPy (and similarly pandas) does different things for boolean (Boolean or “mask” index arrays) and integer (Index arrays) indices the results of indexing will be also be different:

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

Summary table

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

Where the logical operator does not work for NumPy arrays, pandas Series, and pandas DataFrames. The others work on these data structures (and plain Python objects) and work element-wise. However be careful with the bitwise invert on plain Python bools because the bool will be interpreted as integers in this context (for example ~False returns -1 and ~True returns -2).


检测并排除熊猫数据框中的异常值

问题:检测并排除熊猫数据框中的异常值

我有一个只有几列的熊猫数据框。

现在我知道某些行是基于某个列值的离群值。

例如

“ Vol”列的所有值都在周围,12xx而一个值是4000(离群值)。

现在,我想排除具有Vol此类列的行。

因此,从本质上讲,我需要在数据帧上放置一个过滤器,以便我们选择某一列的值在均值例如3个标准差以内的所有行。

有什么优雅的方法可以做到这一点?

I have a pandas data frame with few columns.

Now I know that certain rows are outliers based on a certain column value.

For instance

column ‘Vol’ has all values around 12xx and one value is 4000 (outlier).

Now I would like to exclude those rows that have Vol column like this.

So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.

What is an elegant way to achieve this?


回答 0

如果您的数据框中有多个列,并且想要删除至少一列中具有异常值的所有行,则以下表达式可以一次性完成。

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

描述:

  • 对于每列,首先要计算列中每个值相对于列均值和标准差的Z分数。
  • 然后取Z分数的绝对值,因为方向无关紧要,只有方向低于阈值时才行。
  • all(axis = 1)确保对于每一行,所有列均满足约束。
  • 最后,此条件的结果用于索引数据帧。

If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

description:

  • For each column, first it computes the Z-score of each value in the column, relative to the column mean and standard deviation.
  • Then is takes the absolute of Z-score because the direction does not matter, only if it is below the threshold.
  • all(axis=1) ensures that for each row, all column satisfy the constraint.
  • Finally, result of this condition is used to index the dataframe.

回答 1

boolean就像在索引中那样使用索引numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

对于系列,它类似于:

S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]

Use boolean indexing as you would do in numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

For a series it is similar:

S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]

回答 2

对于每个dataframe列,您可以使用以下方法获得分位数:

q = df["col"].quantile(0.99)

然后过滤:

df[df["col"] < q]

如果需要删除上下限离群值,则将条件与AND语句结合使用:

q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]

For each of your dataframe column, you could get quantile with:

q = df["col"].quantile(0.99)

and then filter with:

df[df["col"] < q]

If one need to remove lower and upper outliers, combine condition with an AND statement:

q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]

回答 3

此答案与@tanemaki提供的答案相似,但使用的是lambda表达式而不是scipy stats

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

要过滤只有一个列(例如“ B”)在三个标准差以内的DataFrame:

df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]

有关如何在滚动基础上应用此z评分的信息,请参见此处:将滚动z评分应用于熊猫数据框

This answer is similar to that provided by @tanemaki, but uses a lambda expression instead of scipy stats.

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

To filter the DataFrame where only ONE column (e.g. ‘B’) is within three standard deviations:

df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]

See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe


回答 4

#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out
#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

回答 5

对于数据框中的每个系列,您可以使用betweenquantile删除异常值。

x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers

For each series in the dataframe, you could use between and quantile to remove outliers.

x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers

回答 6

由于我还没有看到涉及数字非数字属性的答案,因此这里是一个补充性答案。

您可能只想将离群值放在数字属性上(分类变量几乎不可能是离群值)。

功能定义

我还扩展了@tanemaki的建议,以在还存在非数字属性时处理数据:

from scipy import stats

def drop_numerical_outliers(df, z_thresh=3):
    # Constrains will contain `True` or `False` depending on if it is a value below the threshold.
    constrains = df.select_dtypes(include=[np.number]) \
        .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
        .all(axis=1)
    # Drop (inplace) values set to be rejected
    df.drop(df.index[~constrains], inplace=True)

用法

drop_numerical_outliers(df)

想象一个数据集df,其中包含有关房屋的一些值:胡同,土地轮廓,售价,…例如:数据文档

首先,您要可视化散点图上的数据(z分数Thresh = 3):

# Plot data before dropping those greater than z-score 3. 
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)

# Drop the outliers on every attributes
drop_numerical_outliers(train_df)

# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)

Since I haven’t seen an answer that deal with numerical and non-numerical attributes, here is a complement answer.

You might want to drop the outliers only on numerical attributes (categorical variables can hardly be outliers).

Function definition

I have extended @tanemaki’s suggestion to handle data when non-numeric attributes are also present:

from scipy import stats

def drop_numerical_outliers(df, z_thresh=3):
    # Constrains will contain `True` or `False` depending on if it is a value below the threshold.
    constrains = df.select_dtypes(include=[np.number]) \
        .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
        .all(axis=1)
    # Drop (inplace) values set to be rejected
    df.drop(df.index[~constrains], inplace=True)

Usage

drop_numerical_outliers(df)

Example

Imagine a dataset df with some values about houses: alley, land contour, sale price, … E.g: Data Documentation

First, you want to visualise the data on a scatter graph (with z-score Thresh=3):

# Plot data before dropping those greater than z-score 3. 
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)

# Drop the outliers on every attributes
drop_numerical_outliers(train_df)

# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)


回答 7

scipy.stats有方法trim1()trimboth()可以根据排名和删除值的引入百分比将异常值切成一行。

scipy.stats has methods trim1() and trimboth() to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.


回答 8

另一种选择是转换数据,以减轻异常值的影响。您可以通过取消存储数据来做到这一点。

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()

Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data.

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()


回答 9

如果您喜欢方法链接,则可以为所有数字列获取布尔条件,如下所示:

df.sub(df.mean()).div(df.std()).abs().lt(3)

每列的每个值将True/False根据其与平均值之间的距离是否小于三个标准偏差进行转换。

If you like method chaining, you can get your boolean condition for all numeric columns like this:

df.sub(df.mean()).div(df.std()).abs().lt(3)

Each value of each column will be converted to True/False based on whether its less than three standard deviations away from the mean or not.


回答 10

您可以使用布尔掩码:

import pandas as pd

def remove_outliers(df, q=0.05):
    upper = df.quantile(1-q)
    lower = df.quantile(q)
    mask = (df < upper) & (df > lower)
    return mask

t = pd.DataFrame({'train': [1,1,2,3,4,5,6,7,8,9,9],
                  'y': [1,0,0,1,1,0,0,1,1,1,0]})

mask = remove_outliers(t['train'], 0.1)

print(t[mask])

输出:

   train  y
2      2  0
3      3  1
4      4  1
5      5  0
6      6  0
7      7  1
8      8  1

You can use boolean mask:

import pandas as pd

def remove_outliers(df, q=0.05):
    upper = df.quantile(1-q)
    lower = df.quantile(q)
    mask = (df < upper) & (df > lower)
    return mask

t = pd.DataFrame({'train': [1,1,2,3,4,5,6,7,8,9,9],
                  'y': [1,0,0,1,1,0,0,1,1,1,0]})

mask = remove_outliers(t['train'], 0.1)

print(t[mask])

output:

   train  y
2      2  0
3      3  1
4      4  1
5      5  0
6      6  0
7      7  1
8      8  1

回答 11

由于我正处于数据科学之旅的早期阶段,因此我使用以下代码处理异常值。

#Outlier Treatment

def outlier_detect(df):
    for i in df.describe().columns:
        Q1=df.describe().at['25%',i]
        Q3=df.describe().at['75%',i]
        IQR=Q3 - Q1
        LTV=Q1 - 1.5 * IQR
        UTV=Q3 + 1.5 * IQR
        x=np.array(df[i])
        p=[]
        for j in x:
            if j < LTV or j>UTV:
                p.append(df[i].median())
            else:
                p.append(j)
        df[i]=p
    return df

Since I am in a very early stage of my data science journey, I am treating outliers with the code below.

#Outlier Treatment

def outlier_detect(df):
    for i in df.describe().columns:
        Q1=df.describe().at['25%',i]
        Q3=df.describe().at['75%',i]
        IQR=Q3 - Q1
        LTV=Q1 - 1.5 * IQR
        UTV=Q3 + 1.5 * IQR
        x=np.array(df[i])
        p=[]
        for j in x:
            if j < LTV or j>UTV:
                p.append(df[i].median())
            else:
                p.append(j)
        df[i]=p
    return df

回答 12

获得第98和第2个百分位数作为离群值的限制

upper_limit = np.percentile(X_train.logerror.values, 98) 
lower_limit = np.percentile(X_train.logerror.values, 2) # Filter the outliers from the dataframe
data[‘target’].loc[X_train[‘target’]>upper_limit] = upper_limit data[‘target’].loc[X_train[‘target’]<lower_limit] = lower_limit

Get the 98th and 2nd percentile as the limits of our outliers

upper_limit = np.percentile(X_train.logerror.values, 98) 
lower_limit = np.percentile(X_train.logerror.values, 2) # Filter the outliers from the dataframe
data[‘target’].loc[X_train[‘target’]>upper_limit] = upper_limit data[‘target’].loc[X_train[‘target’]<lower_limit] = lower_limit

回答 13

包含数据和2个组的完整示例如下:

进口:

from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)

具有2组的数据示例:G1:组1。G2:组2:

TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1

1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6

2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6

2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")

将文本数据读取到pandas数据框:

df = pd.read_csv(TESTDATA, sep=";")

使用标准偏差定义离群值

stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
           lambda group: (group - group.mean()).abs().div(group.std())) > stds

定义过滤后的数据值和离群值:

dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]

打印结果:

print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)

a full example with data and 2 groups follows:

Imports:

from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)

Data example with 2 groups: G1:Group 1. G2: Group 2:

TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1

1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6

2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6

2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")

Read text data to pandas dataframe:

df = pd.read_csv(TESTDATA, sep=";")

Define the outliers using standard deviations

stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
           lambda group: (group - group.mean()).abs().div(group.std())) > stds

Define filtered data values and the outliers:

dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]

Print the result:

print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)

回答 14

我删除异常值的功能

def drop_outliers(df, field_name):
    distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
    df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)

My function for dropping outliers

def drop_outliers(df, field_name):
    distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
    df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)

回答 15

我更喜欢剪辑而不是放下。下面的内容将在第2个和第98个百分位处固定。

df_list = list(df)
minPercentile = 0.02
maxPercentile = 0.98

for _ in range(numCols):
    df[df_list[_]] = df[df_list[_]].clip((df[df_list[_]].quantile(minPercentile)),(df[df_list[_]].quantile(maxPercentile)))

I prefer to clip rather than drop. the following will clip inplace at the 2nd and 98th pecentiles.

df_list = list(df)
minPercentile = 0.02
maxPercentile = 0.98

for _ in range(numCols):
    df[df_list[_]] = df[df_list[_]].clip((df[df_list[_]].quantile(minPercentile)),(df[df_list[_]].quantile(maxPercentile)))

回答 16

我认为从统计上删除和删除异常值是错误的。它使数据与原始数据不同。也使数据不均匀地成形,因此最好的方法是通过对数据进行对数变换来减少或避免离群值的影响。这为我工作:

np.log(data.iloc[:, :])

Deleting and dropping outliers I believe is wrong statistically. It makes the data different from original data. Also makes data unequally shaped and hence best way is to reduce or avoid the effect of outliers by log transform the data. This worked for me:

np.log(data.iloc[:, :])

系列的真值含糊不清。使用a.empty,a.bool(),a.item(),a.any()或a.all()

问题:系列的真值含糊不清。使用a.empty,a.bool(),a.item(),a.any()或a.all()

在用or条件过滤我的结果数据框时出现问题。我希望我的结果df提取var大于0.25且小于-0.25的所有列值。

下面的逻辑为我提供了一个模糊的真实值,但是当我将此过滤分为两个独立的操作时,它可以工作。这是怎么回事 不知道在哪里使用建议a.empty(), a.bool(), a.item(),a.any() or a.all()

 result = result[(result['var']>0.25) or (result['var']<-0.25)]

Having issue filtering my result dataframe with an or condition. I want my result df to extract all column var values that are above 0.25 and below -0.25.

This logic below gives me an ambiguous truth value however it work when I split this filtering in two separate operations. What is happening here? not sure where to use the suggested a.empty(), a.bool(), a.item(),a.any() or a.all().

 result = result[(result['var']>0.25) or (result['var']<-0.25)]

回答 0

orandPython语句需要truth-值。因为pandas这些被认为是模棱两可的,所以您应该使用“按位” |(或)或&(和)操作:

result = result[(result['var']>0.25) | (result['var']<-0.25)]

对于此类数据结构,它们会重载以生成元素级or(或and)。


只是为该语句添加更多解释:

当您想获取的时bool,将引发异常pandas.Series

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

什么你打是一处经营隐含转换的操作数bool(你用or,但它也恰好为andifwhile):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

除了这些4个语句有一些隐藏某几个Python函数bool调用(如anyallfilter,…),这些都是通常不会有问题的pandas.Series,但出于完整性我想提一提这些。


在您的情况下,该异常并不是真正有用的,因为它没有提到正确的替代方法。对于and和,or您可以使用(如果您想要逐元素比较):

  • numpy.logical_or

    >>> import numpy as np
    >>> np.logical_or(x, y)

    或简单地|算:

    >>> x | y
  • numpy.logical_and

    >>> np.logical_and(x, y)

    或简单地&算:

    >>> x & y

如果您使用的是运算符,请确保由于运算符优先级而正确设置了括号。

几个逻辑numpy的功能,应该工作的pandas.Series


如果您在执行if或时遇到异常,则异常中提到的替代方法更适合while。我将在下面简短地解释每个:

  • 如果要检查您的系列是否为

    >>> x = pd.Series([])
    >>> x.empty
    True
    >>> x = pd.Series([1])
    >>> x.empty
    False

    如果没有明确的布尔值解释,Python通常会将len容器的gth(如list,,tuple…)解释为真值。因此,如果您想进行类似python的检查,可以执行:if x.sizeif not x.empty代替if x

  • 如果您Series包含一个且只有一个布尔值:

    >>> x = pd.Series([100])
    >>> (x > 50).bool()
    True
    >>> (x < 50).bool()
    False
  • 如果要检查系列的第一个也是唯一的一项(例如,.bool()但即使不是布尔型内容也可以使用):

    >>> x = pd.Series([100])
    >>> x.item()
    100
  • 如果要检查所有任何项目是否为非零,非空或非False:

    >>> x = pd.Series([0, 1, 2])
    >>> x.all()   # because one element is zero
    False
    >>> x.any()   # because one (or more) elements are non-zero
    True

The or and and python statements require truth-values. For pandas these are considered ambiguous so you should use “bitwise” | (or) or & (and) operations:

result = result[(result['var']>0.25) | (result['var']<-0.25)]

These are overloaded for these kind of datastructures to yield the element-wise or (or and).


Just to add some more explanation to this statement:

The exception is thrown when you want to get the bool of a pandas.Series:

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What you hit was a place where the operator implicitly converted the operands to bool (you used or but it also happens for and, if and while):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Besides these 4 statements there are several python functions that hide some bool calls (like any, all, filter, …) these are normally not problematic with pandas.Series but for completeness I wanted to mention these.


In your case the exception isn’t really helpful, because it doesn’t mention the right alternatives. For and and or you can use (if you want element-wise comparisons):

  • numpy.logical_or:

    >>> import numpy as np
    >>> np.logical_or(x, y)
    

    or simply the | operator:

    >>> x | y
    
  • numpy.logical_and:

    >>> np.logical_and(x, y)
    

    or simply the & operator:

    >>> x & y
    

If you’re using the operators then make sure you set your parenthesis correctly because of the operator precedence.

There are several logical numpy functions which should work on pandas.Series.


The alternatives mentioned in the Exception are more suited if you encountered it when doing if or while. I’ll shortly explain each of these:

  • If you want to check if your Series is empty:

    >>> x = pd.Series([])
    >>> x.empty
    True
    >>> x = pd.Series([1])
    >>> x.empty
    False
    

    Python normally interprets the length of containers (like list, tuple, …) as truth-value if it has no explicit boolean interpretation. So if you want the python-like check, you could do: if x.size or if not x.empty instead of if x.

  • If your Series contains one and only one boolean value:

    >>> x = pd.Series([100])
    >>> (x > 50).bool()
    True
    >>> (x < 50).bool()
    False
    
  • If you want to check the first and only item of your Series (like .bool() but works even for not boolean contents):

    >>> x = pd.Series([100])
    >>> x.item()
    100
    
  • If you want to check if all or any item is not-zero, not-empty or not-False:

    >>> x = pd.Series([0, 1, 2])
    >>> x.all()   # because one element is zero
    False
    >>> x.any()   # because one (or more) elements are non-zero
    True
    

回答 1

对于布尔逻辑,请使用&|

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

要查看发生了什么,您可以为每个比较获得一列布尔值,例如

df.C > 0.25
0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

当您有多个条件时,将返回多个列。这就是为什么联接逻辑模棱两可的原因。单独使用andor对待每列,因此您首先需要将该列减少为单个布尔值。例如,查看每个列中的任何值或所有值是否为True。

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False

一种实现相同目的的复杂方法是​​将所有这些列压缩在一起,并执行适当的逻辑。

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

有关更多详细信息,请参阅文档中的布尔索引

For boolean logic, use & and |.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

To see what is happening, you get a column of booleans for each comparison, e.g.

df.C > 0.25
0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

When you have multiple criteria, you will get multiple columns returned. This is why the the join logic is ambiguous. Using and or or treats each column separately, so you first need to reduce that column to a single boolean value. For example, to see if any value or all values in each of the columns is True.

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False

One convoluted way to achieve the same thing is to zip all of these columns together, and perform the appropriate logic.

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

For more details, refer to Boolean Indexing in the docs.


回答 2

好吧熊猫使用按位’&”|’ 并且每个条件都应该用’()’包装

例如以下作品

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

但是没有适当括号的相同查询不会

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]

Well pandas use bitwise ‘&’ ‘|’ and each condition should be wrapped in a ‘()’

For example following works

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

But the same query without proper brackets does not

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]

回答 3

或者,您也可以使用操作员模块。更详细的信息在这里Python文档

import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438

Or, alternatively, you could use Operator module. More detailed information is here Python docs

import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438

回答 4

这个极好的答案很好地解释了正在发生的事情并提供了解决方案。我想添加另一种可能在类似情况下适用的解决方案:使用query方法:

result = result.query("(var > 0.25) or (var < -0.25)")

另请参见http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query

(一些我正在使用的数据帧的测试表明,该方法比在一系列布尔值上使用按位运算符要慢一些:2 ms vs. 870 µs)

警告:至少其中一种情况不是很简单,那就是列名恰好是python表达式。我有名为的列WT_38hph_IP_2WT_38hph_input_2log2(WT_38hph_IP_2/WT_38hph_input_2)想执行以下查询:"(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"

我获得了以下异常级联:

  • KeyError: 'log2'
  • UndefinedVariableError: name 'log2' is not defined
  • ValueError: "log2" is not a supported function

我猜这是因为查询解析器试图从前两列中获取内容,而不是用第三列的名称来标识表达式。

这里提出一种可能的解决方法。

This excellent answer explains very well what is happening and provides a solution. I would like to add another solution that might be suitable in similar cases: using the query method:

result = result.query("(var > 0.25) or (var < -0.25)")

See also http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query.

(Some tests with a dataframe I’m currently working with suggest that this method is a bit slower than using the bitwise operators on series of booleans: 2 ms vs. 870 µs)

A piece of warning: At least one situation where this is not straightforward is when column names happen to be python expressions. I had columns named WT_38hph_IP_2, WT_38hph_input_2 and log2(WT_38hph_IP_2/WT_38hph_input_2) and wanted to perform the following query: "(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"

I obtained the following exception cascade:

  • KeyError: 'log2'
  • UndefinedVariableError: name 'log2' is not defined
  • ValueError: "log2" is not a supported function

I guess this happened because the query parser was trying to make something from the first two columns instead of identifying the expression with the name of the third column.

A possible workaround is proposed here.


回答 5

我遇到了同样的错误,并在pyspark数据帧中停滞了几天,我能够通过将na值填充为0来成功解决它,因为我正在比较2个字段的整数值。

I encountered the same error and got stalled with a pyspark dataframe for few days, I was able to resolve it successfully by filling na values with 0 since I was comparing integer values from 2 fields.