问题:如何在大熊猫中测试字符串是否包含列表中的子字符串之一?
有没有这将是一个组合的等同的任何功能df.isin()
和df[col].str.contains()
?
例如,假设我有系列
s = pd.Series(['cat','hat','dog','fog','pet'])
,并且我想找到s
包含的任何一个的所有地方['og', 'at']
,那么我想得到除“宠物”以外的所有东西。
我有一个解决方案,但这很不雅致:
searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()
有一个更好的方法吗?
Is there any function that would be the equivalent of a combination of df.isin()
and df[col].str.contains()
?
For example, say I have the series
s = pd.Series(['cat','hat','dog','fog','pet'])
, and I want to find all places where s
contains any of ['og', 'at']
, I would want to get everything but ‘pet’.
I have a solution, but it’s rather inelegant:
searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()
Is there a better way to do this?
回答 0
一种选择是仅使用正则表达式|
字符尝试匹配系列中单词中的每个子字符串s
(仍使用str.contains
)。
您可以通过将单词searchfor
与结合在一起来构造正则表达式|
:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
就像@AndyHayden在下面的注释中指出的那样,请注意您的子字符串是否具有特殊字符,例如$
和^
您想在字面上进行匹配。这些字符在正则表达式的上下文中具有特定含义,并且会影响匹配。
您可以通过转义非字母数字字符来使子字符串列表更安全re.escape
:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
与结合使用时,此新列表中带有的字符串将逐字匹配每个字符str.contains
。
One option is just to use the regex |
character to try to match each of the substrings in the words in your Series s
(still using str.contains
).
You can construct the regex by joining the words in searchfor
with |
:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $
and ^
which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape
:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains
.
回答 1
您可以使用str.contains
regex模式单独使用OR (|)
:
s[s.str.contains('og|at')]
或者您可以将系列添加到,dataframe
然后使用str.contains
:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
输出:
0 cat
1 hat
2 dog
3 fog
You can use str.contains
alone with a regex pattern using OR (|)
:
s[s.str.contains('og|at')]
Or you could add the series to a dataframe
then use str.contains
:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
0 cat
1 hat
2 dog
3 fog
回答 2
这是一个单行lambda,它也可以工作:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
输入:
searchfor = ['og', 'at']
df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
col1 col2
0 cat 1000.0
1 hat 2000000.0
2 dog 1000.0
3 fog 330000.0
4 pet 330000.0
应用Lambda:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
输出:
col1 col2 TrueFalse
0 cat 1000.0 1
1 hat 2000000.0 1
2 dog 1000.0 1
3 fog 330000.0 1
4 pet 330000.0 0
Here is a one line lambda that also works:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
Input:
searchfor = ['og', 'at']
df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
col1 col2
0 cat 1000.0
1 hat 2000000.0
2 dog 1000.0
3 fog 330000.0
4 pet 330000.0
Apply Lambda:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
Output:
col1 col2 TrueFalse
0 cat 1000.0 1
1 hat 2000000.0 1
2 dog 1000.0 1
3 fog 330000.0 1
4 pet 330000.0 0