问题:如何按多列过滤熊猫数据框
要按单列过滤数据帧(df),如果我们考虑男性和女性的数据,则可以:
males = df[df[Gender]=='Male']
问题1-但是,如果数据跨越多年并且我只想看2014年的男性,该怎么办?
在其他语言中,我可能会做类似的事情:
if A = "Male" and if B = "2014" then
(除了我要执行此操作,并在新的数据框对象中获取原始数据框的子集)
问题2。如何循环执行此操作,并为每个唯一的年份和性别集创建一个数据框对象(例如,2013-男,2013-女,2014-男和2014-女的df
for y in year:
for g in gender:
df = .....
To filter a dataframe (df) by a single column, if we consider data with male and females we might:
males = df[df[Gender]=='Male']
Question 1 – But what if the data spanned multiple years and i wanted to only see males for 2014?
In other languages I might do something like:
if A = "Male" and if B = "2014" then
(except I want to do this and get a subset of the original dataframe in a new dataframe object)
Question 2. How do I do this in a loop, and create a dataframe object for each unique sets of year and gender (i.e. a df for: 2013-Male, 2013-Female, 2014-Male, and 2014-Female
for y in year:
for g in gender:
df = .....
回答 0
使用&
运算符时,不要忘了用()
:包裹子语句:
males = df[(df[Gender]=='Male') & (df[Year]==2014)]
要将数据帧存储在dict
for循环中:
from collections import defaultdict
dic={}
for g in ['male', 'female']:
dic[g]=defaultdict(dict)
for y in [2013, 2014]:
dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict
编辑:
您的演示getDF
:
def getDF(dic, gender, year):
return dic[gender][year]
print genDF(dic, 'male', 2014)
Using &
operator, don’t forget to wrap the sub-statements with ()
:
males = df[(df[Gender]=='Male') & (df[Year]==2014)]
To store your dataframes in a dict
using a for loop:
from collections import defaultdict
dic={}
for g in ['male', 'female']:
dic[g]=defaultdict(dict)
for y in [2013, 2014]:
dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict
EDIT:
A demo for your getDF
:
def getDF(dic, gender, year):
return dic[gender][year]
print genDF(dic, 'male', 2014)
回答 1
对于要用作过滤器且依赖于多个列的更通用的布尔函数,可以使用:
df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]
其中f是一个函数,该函数适用于col_1和col_2中的每对元素(x1,x2),并根据您要启用的任何条件(x1,x2)返回True或False。
For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:
df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]
where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).
回答 2
从pandas 0.13开始,这是最有效的方法。
df.query('Gender=="Male" & Year=="2014" ')
Start from pandas 0.13, this is the most efficient way.
df.query('Gender=="Male" & Year=="2014" ')
回答 3
如果有人想知道什么是更快的过滤方法(可接受的答案或来自@redreamality的答案):
import pandas as pd
import numpy as np
length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)
%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]
100,000行的结果:
6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10,000,000行的结果:
326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
因此,结果取决于大小和数据。在我的笔记本电脑上,query()
经过50万行之后速度会更快。此外,字符串搜索Year=="2014"
有不必要的开销(Year==2014
更快)。
In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):
import pandas as pd
import numpy as np
length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)
%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]
Results for 100,000 rows:
6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results for 10,000,000 rows:
326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So results depend on the size and the data. On my laptop, query()
gets faster after 500k rows. Further, the string search in Year=="2014"
has an unnecessary overhead (Year==2014
is faster).
回答 4
您可以使用query
中创建自己的过滤器功能pandas
。在这里,您可以df
按所有kwargs
参数过滤结果。不要忘记添加一些验证器(kwargs
过滤器)来获得自己的过滤器功能df
。
def filter(df, **kwargs):
query_list = []
for key in kwargs.keys():
query_list.append(f'{key}=="{kwargs[key]}"')
query = ' & '.join(query_list)
return df.query(query)
You can create your own filter function using query
in pandas
. Here you have filtering of df
results by all the kwargs
parameters. Dont’ forgot to add some validators(kwargs
filtering) to get filter function for your own df
.
def filter(df, **kwargs):
query_list = []
for key in kwargs.keys():
query_list.append(f'{key}=="{kwargs[key]}"')
query = ' & '.join(query_list)
return df.query(query)
回答 5
您可以使用np.logical_and
运算符替换&
(或np.logical_or
替换|
)以多列(多于两列)进行过滤
如果您提供多个字段的目标值,则这是完成此任务的示例函数。您可以将其调整为适用于不同类型的过滤或其他方式:
def filter_df(df, filter_values):
"""Filter df by matching targets for multiple columns.
Args:
df (pd.DataFrame): dataframe
filter_values (None or dict): Dictionary of the form:
`{<field>: <target_values_list>}`
used to filter columns data.
"""
import numpy as np
if filter_values is None or not filter_values:
return df
return df[
np.logical_and.reduce([
df[column].isin(target_values)
for column, target_values in filter_values.items()
])
]
用法:
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, 4]})
filter_df(df, {
'a': [1, 2, 3],
'b': [1, 2, 4]
})
You can filter by multiple columns (more than two) by using the np.logical_and
operator to replace &
(or np.logical_or
to replace |
)
Here’s an example function that does the job, if you provide target values for multiple fields. You can adapt it for different types of filtering and whatnot:
def filter_df(df, filter_values):
"""Filter df by matching targets for multiple columns.
Args:
df (pd.DataFrame): dataframe
filter_values (None or dict): Dictionary of the form:
`{<field>: <target_values_list>}`
used to filter columns data.
"""
import numpy as np
if filter_values is None or not filter_values:
return df
return df[
np.logical_and.reduce([
df[column].isin(target_values)
for column, target_values in filter_values.items()
])
]
Usage:
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, 4]})
filter_df(df, {
'a': [1, 2, 3],
'b': [1, 2, 4]
})