问题:如何跨熊猫的多个数据框列“选择不同”?
我正在寻找一种等效于SQL的方法
SELECT DISTINCT col1, col2 FROM dataframe_table
pandas sql比较与无关distinct
。
.unique()
仅适用于单个列,因此我想我可以合并这些列,或将它们放在列表/元组中并进行比较,但这似乎是熊猫应该以更原生的方式进行的操作。
我是否缺少明显的东西,还是没有办法做到这一点?
I’m looking for a way to do the equivalent to the SQL
SELECT DISTINCT col1, col2 FROM dataframe_table
The pandas sql comparison doesn’t have anything about distinct
.
.unique()
only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.
Am I missing something obvious, or is there no way to do this?
回答 0
您可以使用该drop_duplicates
方法来获取DataFrame中的唯一行:
In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})
In [30]: df
Out[30]:
a b
0 1 3
1 2 4
2 1 3
3 2 5
In [32]: df.drop_duplicates()
Out[32]:
a b
0 1 3
1 2 4
3 2 5
subset
如果只想使用某些列来确定唯一性,则还可以提供关键字参数。请参阅文档字符串。
You can use the drop_duplicates
method to get the unique rows in a DataFrame:
In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})
In [30]: df
Out[30]:
a b
0 1 3
1 2 4
2 1 3
3 2 5
In [32]: df.drop_duplicates()
Out[32]:
a b
0 1 3
1 2 4
3 2 5
You can also provide the subset
keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.
回答 1
我尝试了不同的解决方案。首先是:
a_df=np.unique(df[['col1','col2']], axis=0)
并且适用于非对象数据,这也是另一种避免错误(针对对象列类型)的方法是应用drop_duplicates()
a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]
您也可以使用SQL来执行此操作,但是在我的情况下,它的运行速度非常慢:
from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)
I’ve tried different solutions. First was:
a_df=np.unique(df[['col1','col2']], axis=0)
and it works well for not object data
Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()
a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]
You can also use SQL to do this, but it worked very slow in my case:
from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)
回答 2
没有unique
用于df的方法,如果每列的唯一值的数量相同,则可以进行以下操作:df.apply(pd.Series.unique)
但是,如果不这样做,则会出现错误。另一种方法是将值存储在以列名称为键的dict中:
In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
d[col] = df[col].unique()
d
Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}
There is no unique
method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique)
but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:
In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
d[col] = df[col].unique()
d
Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}
回答 3
为了解决类似的问题,我正在使用groupby
:
print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")
不过,这是否合适将取决于您要对结果执行什么操作(在我的情况下,我只是想要与COUNT DISTINCT
所示结果等效)。
To solve a similar problem, I’m using groupby
:
print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")
Whether that’s appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT
as shown).
回答 4
回答 5
您可以采用列的集合,并从较大的集合中减去较小的集合:
distinct_values = set(df['a'])-set(df['b'])
You can take the sets of the columns and just subtract the smaller set from the larger set:
distinct_values = set(df['a'])-set(df['b'])