标签归档:contains

在pandas中的DataFrame上搜索“不包含”

问题:在pandas中的DataFrame上搜索“不包含”

我已经进行了一些搜索,无法弄清楚如何通过过滤数据帧df["col"].str.contains(word),但是我想知道是否有一种方法可以反向执行:通过该集合的补充来过滤数据帧。例如:的效果!(df["col"].str.contains(word))

可以通过一种DataFrame方法来完成吗?

I’ve done some searching and can’t figure out how to filter a dataframe by df["col"].str.contains(word), however I’m wondering if there is a way to do the reverse: filter a dataframe by that set’s compliment. eg: to the effect of !(df["col"].str.contains(word)).

Can this be done through a DataFrame method?


回答 0

您可以使用invert(〜)运算符(其作用类似于非布尔数据):

new_df = df[~df["col"].str.contains(word)]

new_dfRHS返回的副本在哪里。

包含还接受正则表达式…


如果以上方法引发ValueError,则可能是由于您混合使用了数据类型,所以请使用na=False

new_df = df[~df["col"].str.contains(word, na=False)]

要么,

new_df = df[df["col"].str.contains(word) == False]

You can use the invert (~) operator (which acts like a not for boolean data):

new_df = df[~df["col"].str.contains(word)]

, where new_df is the copy returned by RHS.

contains also accepts a regular expression…


If the above throws a ValueError, the reason is likely because you have mixed datatypes, so use na=False:

new_df = df[~df["col"].str.contains(word, na=False)]

Or,

new_df = df[df["col"].str.contains(word) == False]

回答 1

我也遇到了not(〜)符号的问题,所以这是另一个StackOverflow线程的另一种方式:

df[df["col"].str.contains('this|that')==False]

I was having trouble with the not (~) symbol as well, so here’s another way from another StackOverflow thread:

df[df["col"].str.contains('this|that')==False]

回答 2

您可以使用Apply和Lambda选择列中包含列表中任何内容的行。对于您的方案:

df[df["col"].apply(lambda x:x not in [word1,word2,word3])]

You can use Apply and Lambda to select rows where a column contains any thing in a list. For your scenario :

df[df["col"].apply(lambda x:x not in [word1,word2,word3])]

回答 3

在使用上面Andy推荐的命令之前,我必须摆脱NULL值。一个例子:

df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df

    first   second  third
0   myword  myword   NaN
1   myword  NaN      myword 
2   myword  myword   NaN

现在运行命令:

~df["second"].str.contains(word)

我收到以下错误:

TypeError: bad operand type for unary ~: 'float'

我首先使用dropna()或fillna()摆脱了NULL值,然后重试了命令,没有问题。

I had to get rid of the NULL values before using the command recommended by Andy above. An example:

df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df

    first   second  third
0   myword  myword   NaN
1   myword  NaN      myword 
2   myword  myword   NaN

Now running the command:

~df["second"].str.contains(word)

I get the following error:

TypeError: bad operand type for unary ~: 'float'

I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.


回答 4

我希望答案已经发布

我正在添加框架以查找多个单词并从dataFrame中取反

这里'word1','word2','word3','word4'=要搜索的模式列表

df = DataFrame

column_a =来自DataFrame df的列名

Search_for_These_values = ['word1','word2','word3','word4'] 

pattern = '|'.join(Search_for_These_values)

result = df.loc[~(df['column_a'].str.contains(pattern, case=False)]

I hope the answers are already posted

I am adding the framework to find multiple words and negate those from dataFrame.

Here 'word1','word2','word3','word4' = list of patterns to search

df = DataFrame

column_a = A column name from from DataFrame df

Search_for_These_values = ['word1','word2','word3','word4'] 

pattern = '|'.join(Search_for_These_values)

result = df.loc[~(df['column_a'].str.contains(pattern, case=False)]

回答 5

除了nanselm2的答案,您可以使用0代替False

df["col"].str.contains(word)==0

Additional to nanselm2’s answer, you can use 0 instead of False:

df["col"].str.contains(word)==0

检查项目是否在数组/列表中

问题:检查项目是否在数组/列表中

如果我有一个字符串数组,是否可以检查字符串是否在数组中而不进行for循环?具体来说,我正在寻找一种在if语句中执行此操作的方法,因此如下所示:

if [check that item is in array]:

If I’ve got an array of strings, can I check to see if a string is in the array without doing a for loop? Specifically, I’m looking for a way to do it within an if statement, so something like this:

if [check that item is in array]:

回答 0

假设您说“列表”的意思是“数组”,那么您可以

if item in my_list:
    # whatever

这适用于任何集合,而不仅仅是列表。对于字典,它将检查字典中是否存在给定的键。

Assuming you mean “list” where you say “array”, you can do

if item in my_list:
    # whatever

This works for any collection, not just for lists. For dictionaries, it checks whether the given key is present in the dictionary.


回答 1

我还要假设您说“数组”时的意思是“列表”。Sven Marnach的解决方案很好。如果要对列表进行重复检查,则可能有必要将其转换为集合或冻结集,这对于每次检查来说可能更快。假设您的str列表称为subjects

subject_set = frozenset(subjects)
if query in subject_set:
    # whatever

I’m also going to assume that you mean “list” when you say “array.” Sven Marnach’s solution is good. If you are going to be doing repeated checks on the list, then it might be worth converting it to a set or frozenset, which can be faster for each check. Assuming your list of strs is called subjects:

subject_set = frozenset(subjects)
if query in subject_set:
    # whatever

回答 2

使用lambda函数。

假设您有一个数组:

nums = [0,1,5]

检查5是否在nums

(len(filter (lambda x : x == 5, nums)) > 0)

该解决方案更加强大。现在,您可以检查数组中是否有满足特定条件的任何数字nums

例如,检查是否存在大于或等于5的任何数字nums

(len(filter (lambda x : x >= 5, nums)) > 0)

Use a lambda function.

Let’s say you have an array:

nums = [0,1,5]

Check whether 5 is in nums in Python 3.X:

(len(list(filter (lambda x : x == 5, nums))) > 0)

Check whether 5 is in nums in Python 2.7:

(len(filter (lambda x : x == 5, nums)) > 0)

This solution is more robust. You can now check whether any number satisfying a certain condition is in your array nums.

For example, check whether any number that is greater than or equal to 5 exists in nums:

(len(filter (lambda x : x >= 5, nums)) > 0)

回答 3

您必须对数组使用.values。例如,假设您的数据框的列名称为test [‘Name’],则可以

if name in test['Name'].values :
   print(name)

对于普通列表,您不必使用.values

You have to use .values for arrays. for example say you have dataframe which has a column name ie, test[‘Name’], you can do

if name in test['Name'].values :
   print(name)

for a normal list you dont have to use .values


回答 4

您也可以对数组使用相同的语法。例如,在熊猫系列中搜索:

ser = pd.Series(['some', 'strings', 'to', 'query'])

if item in ser.values:
    # do stuff

You can also use the same syntax for an array. For example, searching within a Pandas series:

ser = pd.Series(['some', 'strings', 'to', 'query'])

if item in ser.values:
    # do stuff

列表是否包含简短的包含功能?

问题:列表是否包含简短的包含功能?

我看到人们正在使用any另一个列表来查看列表中是否存在某项,但是有一种快速的方法吗?

if list.contains(myItem):
    # do something

I see people are using any to gather another list to see if an item exists in a list, but is there a quick way to just do?:

if list.contains(myItem):
    # do something

回答 0

您可以使用以下语法:

if myItem in list:
    # do something

同样,逆运算符:

if myItem not in list:
    # do something

它适用于列表,元组,集合和字典(检查键)。

请注意,这是列表和元组中的O(n)操作,而集合和字典中是O(1)操作。

You can use this syntax:

if myItem in list:
    # do something

Also, inverse operator:

if myItem not in list:
    # do something

It’s work fine for lists, tuples, sets and dicts (check keys).

Note that this is an O(n) operation in lists and tuples, but an O(1) operation in sets and dicts.


回答 1

除了别人说过的话,您可能还想知道什么in是调用list.__contains__方法,您可以在编写的任何类上定义该方法,并且可以非常方便地全面使用python。  

愚蠢的用途可能是:

>>> class ContainsEverything:
    def __init__(self):
        return None
    def __contains__(self, *elem, **k):
        return True


>>> a = ContainsEverything()
>>> 3 in a
True
>>> a in a
True
>>> False in a
True
>>> False not in a
False
>>>         

In addition to what other have said, you may also be interested to know that what in does is to call the list.__contains__ method, that you can define on any class you write and can get extremely handy to use python at his full extent.  

A dumb use may be:

>>> class ContainsEverything:
    def __init__(self):
        return None
    def __contains__(self, *elem, **k):
        return True


>>> a = ContainsEverything()
>>> 3 in a
True
>>> a in a
True
>>> False in a
True
>>> False not in a
False
>>>         

回答 2

我最近想出了这条衬垫,用于获取True列表中是否包含任何数量的项目,或者该列表中不包含任何项目或False根本不包含任何项目。使用next(...)会给它提供默认的返回值(False),这意味着它的运行速度应比运行整个列表理解的速度快得多。

list_does_contain = next((True for item in list_to_test if item == test_item), False)

I came up with this one liner recently for getting True if a list contains any number of occurrences of an item, or False if it contains no occurrences or nothing at all. Using next(...) gives this a default return value (False) and means it should run significantly faster than running the whole list comprehension.

list_does_contain = next((True for item in list_to_test if item == test_item), False)


回答 3

如果该项目不存在,则list方法index将返回,-1如果该项目存在,则将返回该项目在列表中的索引。或者,if您可以在语句中执行以下操作:

if myItem in list:
    #do things

您还可以使用以下if语句检查元素是否不在列表中:

if myItem not in list:
    #do things

The list method index will return -1 if the item is not present, and will return the index of the item in the list if it is present. Alternatively in an if statement you can do the following:

if myItem in list:
    #do things

You can also check if an element is not in a list with the following if statement:

if myItem not in list:
    #do things

Python是否具有字符串“包含”子字符串方法?

问题:Python是否具有字符串“包含”子字符串方法?

我在寻找Python中的string.containsor string.indexof方法。

我想要做:

if not somestring.contains("blah"):
   continue

I’m looking for a string.contains or string.indexof method in Python.

I want to do:

if not somestring.contains("blah"):
   continue

回答 0

您可以使用in运算符

if "blah" not in somestring: 
    continue

You can use the in operator:

if "blah" not in somestring: 
    continue

回答 1

如果只是子字符串搜索,则可以使用string.find("substring")

你必须与小心一点findindexin虽然,因为它们是字符串搜索。换句话说,这是:

s = "This be a string"
if s.find("is") == -1:
    print("No 'is' here!")
else:
    print("Found 'is' in the string.")

它将打印Found 'is' in the string.类似,if "is" in s:结果为True。这可能是您想要的,也可能不是。

If it’s just a substring search you can use string.find("substring").

You do have to be a little careful with find, index, and in though, as they are substring searches. In other words, this:

s = "This be a string"
if s.find("is") == -1:
    print("No 'is' here!")
else:
    print("Found 'is' in the string.")

It would print Found 'is' in the string. Similarly, if "is" in s: would evaluate to True. This may or may not be what you want.


回答 2

Python是否有包含子字符串方法的字符串?

是的,但是Python有一个比较运算符,您应该改用它,因为该语言打算使用它,而其他程序员则希望您使用它。该关键字是in,用作比较运算符:

>>> 'foo' in '**foo**'
True

原始问题要求的相反的(补码)是not in

>>> 'foo' not in '**foo**' # returns False
False

这在语义上not 'foo' in '**foo**'与之相同,但是它在语言中更具可读性,并作为可读性的改进而明确提供。

避免使用__contains__findindex

如所承诺的,这是contains方法:

str.__contains__('**foo**', 'foo')

返回True。您也可以从超字符串的实例调用此函数:

'**foo**'.__contains__('foo')

但是不要。以下划线开头的方法在语义上被视为私有。使用此功能的唯一原因是在扩展inand not in功能(例如,子类化str)时:

class NoisyString(str):
    def __contains__(self, other):
        print('testing if "{0}" in "{1}"'.format(other, self))
        return super(NoisyString, self).__contains__(other)

ns = NoisyString('a string with a substring inside')

现在:

>>> 'substring' in ns
testing if "substring" in "a string with a substring inside"
True

另外,请避免使用以下字符串方法:

>>> '**foo**'.index('foo')
2
>>> '**foo**'.find('foo')
2

>>> '**oo**'.find('foo')
-1
>>> '**oo**'.index('foo')

Traceback (most recent call last):
  File "<pyshell#40>", line 1, in <module>
    '**oo**'.index('foo')
ValueError: substring not found

其他语言可能没有直接测试子字符串的方法,因此您必须使用这些类型的方法,但是对于Python,使用in比较运算符会更加有效。

性能比较

我们可以比较实现同一目标的各种方式。

import timeit

def in_(s, other):
    return other in s

def contains(s, other):
    return s.__contains__(other)

def find(s, other):
    return s.find(other) != -1

def index(s, other):
    try:
        s.index(other)
    except ValueError:
        return False
    else:
        return True



perf_dict = {
'in:True': min(timeit.repeat(lambda: in_('superstring', 'str'))),
'in:False': min(timeit.repeat(lambda: in_('superstring', 'not'))),
'__contains__:True': min(timeit.repeat(lambda: contains('superstring', 'str'))),
'__contains__:False': min(timeit.repeat(lambda: contains('superstring', 'not'))),
'find:True': min(timeit.repeat(lambda: find('superstring', 'str'))),
'find:False': min(timeit.repeat(lambda: find('superstring', 'not'))),
'index:True': min(timeit.repeat(lambda: index('superstring', 'str'))),
'index:False': min(timeit.repeat(lambda: index('superstring', 'not'))),
}

现在我们看到使用in比其他方法快得多。进行等效操作的时间越少越好:

>>> perf_dict
{'in:True': 0.16450627865128808,
 'in:False': 0.1609668098178645,
 '__contains__:True': 0.24355481654697542,
 '__contains__:False': 0.24382793854783813,
 'find:True': 0.3067379407923454,
 'find:False': 0.29860888058124146,
 'index:True': 0.29647137792585454,
 'index:False': 0.5502287584545229}

Does Python have a string contains substring method?

Yes, but Python has a comparison operator that you should use instead, because the language intends its usage, and other programmers will expect you to use it. That keyword is in, which is used as a comparison operator:

>>> 'foo' in '**foo**'
True

The opposite (complement), which the original question asks for, is not in:

>>> 'foo' not in '**foo**' # returns False
False

This is semantically the same as not 'foo' in '**foo**' but it’s much more readable and explicitly provided for in the language as a readability improvement.

Avoid using __contains__, find, and index

As promised, here’s the contains method:

str.__contains__('**foo**', 'foo')

returns True. You could also call this function from the instance of the superstring:

'**foo**'.__contains__('foo')

But don’t. Methods that start with underscores are considered semantically private. The only reason to use this is when extending the in and not in functionality (e.g. if subclassing str):

class NoisyString(str):
    def __contains__(self, other):
        print('testing if "{0}" in "{1}"'.format(other, self))
        return super(NoisyString, self).__contains__(other)

ns = NoisyString('a string with a substring inside')

and now:

>>> 'substring' in ns
testing if "substring" in "a string with a substring inside"
True

Also, avoid the following string methods:

>>> '**foo**'.index('foo')
2
>>> '**foo**'.find('foo')
2

>>> '**oo**'.find('foo')
-1
>>> '**oo**'.index('foo')

Traceback (most recent call last):
  File "<pyshell#40>", line 1, in <module>
    '**oo**'.index('foo')
ValueError: substring not found

Other languages may have no methods to directly test for substrings, and so you would have to use these types of methods, but with Python, it is much more efficient to use the in comparison operator.

Performance comparisons

We can compare various ways of accomplishing the same goal.

import timeit

def in_(s, other):
    return other in s

def contains(s, other):
    return s.__contains__(other)

def find(s, other):
    return s.find(other) != -1

def index(s, other):
    try:
        s.index(other)
    except ValueError:
        return False
    else:
        return True



perf_dict = {
'in:True': min(timeit.repeat(lambda: in_('superstring', 'str'))),
'in:False': min(timeit.repeat(lambda: in_('superstring', 'not'))),
'__contains__:True': min(timeit.repeat(lambda: contains('superstring', 'str'))),
'__contains__:False': min(timeit.repeat(lambda: contains('superstring', 'not'))),
'find:True': min(timeit.repeat(lambda: find('superstring', 'str'))),
'find:False': min(timeit.repeat(lambda: find('superstring', 'not'))),
'index:True': min(timeit.repeat(lambda: index('superstring', 'str'))),
'index:False': min(timeit.repeat(lambda: index('superstring', 'not'))),
}

And now we see that using in is much faster than the others. Less time to do an equivalent operation is better:

>>> perf_dict
{'in:True': 0.16450627865128808,
 'in:False': 0.1609668098178645,
 '__contains__:True': 0.24355481654697542,
 '__contains__:False': 0.24382793854783813,
 'find:True': 0.3067379407923454,
 'find:False': 0.29860888058124146,
 'index:True': 0.29647137792585454,
 'index:False': 0.5502287584545229}

回答 3

if needle in haystack:正如@Michael所说,这是正常的用法-它依赖于in运算符,比方法调用更具可读性和速度。

如果您确实需要一个方法而不是一个运算符(例如,key=对一个非常特殊的类做一些奇怪的事情??),那就是'haystack'.__contains__。但是由于您的示例是用于的if,我想您并不是真的在说什么;-)。直接使用特殊方法不是很好的形式(既不可读也不高效),而是要通过委托给它们的运算符和内建函数使用它们。

if needle in haystack: is the normal use, as @Michael says — it relies on the in operator, more readable and faster than a method call.

If you truly need a method instead of an operator (e.g. to do some weird key= for a very peculiar sort…?), that would be 'haystack'.__contains__. But since your example is for use in an if, I guess you don’t really mean what you say;-). It’s not good form (nor readable, nor efficient) to use special methods directly — they’re meant to be used, instead, through the operators and builtins that delegate to them.


回答 4

in Python字符串和列表

下面是一些有用的示例,它们说明了该in方法:

"foo" in "foobar"
True

"foo" in "Foobar"
False

"foo" in "Foobar".lower()
True

"foo".capitalize() in "Foobar"
True

"foo" in ["bar", "foo", "foobar"]
True

"foo" in ["fo", "o", "foobar"]
False

["foo" in a for a in ["fo", "o", "foobar"]]
[False, False, True]

警告。列表是可迭代的,并且该in方法作用于可迭代的对象,而不仅仅是字符串。

in Python strings and lists

Here are a few useful examples that speak for themselves concerning the in method:

"foo" in "foobar"
True

"foo" in "Foobar"
False

"foo" in "Foobar".lower()
True

"foo".capitalize() in "Foobar"
True

"foo" in ["bar", "foo", "foobar"]
True

"foo" in ["fo", "o", "foobar"]
False

["foo" in a for a in ["fo", "o", "foobar"]]
[False, False, True]

Caveat. Lists are iterables, and the in method acts on iterables, not just strings.


回答 5

如果您满意"blah" in somestring但希望将其用作函数/方法调用,则可以执行此操作

import operator

if not operator.contains(somestring, "blah"):
    continue

在Python 操作符模块中,或多或少可以找到Python中的所有操作符,包括in

If you are happy with "blah" in somestring but want it to be a function/method call, you can probably do this

import operator

if not operator.contains(somestring, "blah"):
    continue

All operators in Python can be more or less found in the operator module including in.


回答 6

因此,显然,矢量方向比较没有类似之处。一个明显的Python方式是:

names = ['bob', 'john', 'mike']
any(st in 'bob and john' for st in names) 
>> True

any(st in 'mary and jane' for st in names) 
>> False

So apparently there is nothing similar for vector-wise comparison. An obvious Python way to do so would be:

names = ['bob', 'john', 'mike']
any(st in 'bob and john' for st in names) 
>> True

any(st in 'mary and jane' for st in names) 
>> False

回答 7

您可以使用y.count()

它将返回子字符串出现在字符串中的次数的整数值。

例如:

string.count("bah") >> 0
string.count("Hello") >> 1

You can use y.count().

It will return the integer value of the number of times a sub string appears in a string.

For example:

string.count("bah") >> 0
string.count("Hello") >> 1

回答 8

这是您的答案:

if "insert_char_or_string_here" in "insert_string_to_search_here":
    #DOSTUFF

检查是否为假:

if not "insert_char_or_string_here" in "insert_string_to_search_here":
    #DOSTUFF

要么:

if "insert_char_or_string_here" not in "insert_string_to_search_here":
    #DOSTUFF

Here is your answer:

if "insert_char_or_string_here" in "insert_string_to_search_here":
    #DOSTUFF

For checking if it is false:

if not "insert_char_or_string_here" in "insert_string_to_search_here":
    #DOSTUFF

OR:

if "insert_char_or_string_here" not in "insert_string_to_search_here":
    #DOSTUFF

回答 9

您可以使用正则表达式获取出现次数:

>>> import re
>>> print(re.findall(r'( |t)', to_search_in)) # searches for t or space
['t', ' ', 't', ' ', ' ']

You can use regular expressions to get the occurrences:

>>> import re
>>> print(re.findall(r'( |t)', to_search_in)) # searches for t or space
['t', ' ', 't', ' ', ' ']