问题:在pandas中的DataFrame上搜索“不包含”
我已经进行了一些搜索,无法弄清楚如何通过过滤数据帧df["col"].str.contains(word)
,但是我想知道是否有一种方法可以反向执行:通过该集合的补充来过滤数据帧。例如:的效果!(df["col"].str.contains(word))
。
可以通过一种DataFrame
方法来完成吗?
I’ve done some searching and can’t figure out how to filter a dataframe by df["col"].str.contains(word)
, however I’m wondering if there is a way to do the reverse: filter a dataframe by that set’s compliment. eg: to the effect of !(df["col"].str.contains(word))
.
Can this be done through a DataFrame
method?
回答 0
您可以使用invert(〜)运算符(其作用类似于非布尔数据):
new_df = df[~df["col"].str.contains(word)]
,new_df
RHS返回的副本在哪里。
包含还接受正则表达式…
如果以上方法引发ValueError,则可能是由于您混合使用了数据类型,所以请使用na=False
:
new_df = df[~df["col"].str.contains(word, na=False)]
要么,
new_df = df[df["col"].str.contains(word) == False]
You can use the invert (~) operator (which acts like a not for boolean data):
new_df = df[~df["col"].str.contains(word)]
, where new_df
is the copy returned by RHS.
contains also accepts a regular expression…
If the above throws a ValueError, the reason is likely because you have mixed datatypes, so use na=False
:
new_df = df[~df["col"].str.contains(word, na=False)]
Or,
new_df = df[df["col"].str.contains(word) == False]
回答 1
我也遇到了not(〜)符号的问题,所以这是另一个StackOverflow线程的另一种方式:
df[df["col"].str.contains('this|that')==False]
I was having trouble with the not (~) symbol as well, so here’s another way from another StackOverflow thread:
df[df["col"].str.contains('this|that')==False]
回答 2
您可以使用Apply和Lambda选择列中包含列表中任何内容的行。对于您的方案:
df[df["col"].apply(lambda x:x not in [word1,word2,word3])]
You can use Apply and Lambda to select rows where a column contains any thing in a list. For your scenario :
df[df["col"].apply(lambda x:x not in [word1,word2,word3])]
回答 3
在使用上面Andy推荐的命令之前,我必须摆脱NULL值。一个例子:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
现在运行命令:
~df["second"].str.contains(word)
我收到以下错误:
TypeError: bad operand type for unary ~: 'float'
我首先使用dropna()或fillna()摆脱了NULL值,然后重试了命令,没有问题。
I had to get rid of the NULL values before using the command recommended by Andy above. An example:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
Now running the command:
~df["second"].str.contains(word)
I get the following error:
TypeError: bad operand type for unary ~: 'float'
I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.
回答 4
我希望答案已经发布
我正在添加框架以查找多个单词并从dataFrame中取反。
这里'word1','word2','word3','word4'
=要搜索的模式列表
df
= DataFrame
column_a
=来自DataFrame df的列名
Search_for_These_values = ['word1','word2','word3','word4']
pattern = '|'.join(Search_for_These_values)
result = df.loc[~(df['column_a'].str.contains(pattern, case=False)]
I hope the answers are already posted
I am adding the framework to find multiple words and negate those from dataFrame.
Here 'word1','word2','word3','word4'
= list of patterns to search
df
= DataFrame
column_a
= A column name from from DataFrame df
Search_for_These_values = ['word1','word2','word3','word4']
pattern = '|'.join(Search_for_These_values)
result = df.loc[~(df['column_a'].str.contains(pattern, case=False)]
回答 5
除了nanselm2的答案,您可以使用0
代替False
:
df["col"].str.contains(word)==0
Additional to nanselm2’s answer, you can use 0
instead of False
:
df["col"].str.contains(word)==0