标签归档:regex

提取正则表达式匹配项的一部分

问题:提取正则表达式匹配项的一部分

我想要一个正则表达式从HTML页面提取标题。目前我有这个:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

是否有一个正则表达式仅提取<title>的内容,所以我不必删除标签?

I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?


回答 0

( )在正则表达式和group(1)python中检索捕获的字符串(re.search将返回None如果没有找到结果,所以不要用group()直接):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn’t find the result, so don’t use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

回答 1

请注意,通过开始Python 3.8并引入赋值表达式(PEP 572):=运算符),可以通过在if条件中直接将匹配结果捕获为变量并将其在条件体内重复使用,从而对KrzysztofKrasoń解决方案进行一些改进:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

Note that starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it’s possible to improve a bit on Krzysztof Krasoń’s solution by capturing the match result directly within the if condition as a variable and re-use it in the condition’s body:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

回答 2

尝试使用捕获组:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Try using capturing groups:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 3

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

回答 4

我可以推荐你去美丽汤。汤是一个很好的库,可以解析您的所有html文档。

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

回答 5

尝试:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Try:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 6

提供的代码段不能应付Exceptions 我的建议

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

如果未找到模式或第一个匹配项,则默认情况下返回空字符串。

The provided pieces of code do not cope with Exceptions May I suggest

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.


回答 7

我认为这足够了:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

…假设您的文本(HTML)位于名为“ text”的变量中。

这也假定没有其他HTML标记可以合法地嵌入HTML TITLE标记内部,并且没有办法合法地将任何其他<字符嵌入这样的容器/块中。

但是

不要在Python中使用正则表达式进行HTML解析。使用HTML解析器!(除非您要编写完整的解析器,否则当标准库中已经包含各种HTML,SGML和XML解析器时,这将是一项额外的工作。

如果您处理“真实世界” 标记汤 HTML(通常不符合任何SGML / XML验证器),请使用BeautifulSoup包。它尚未出现在标准库中,但为此目的广泛建议使用。

另一个选项是:lxml …,它是为结构正确(符合标准的HTML)编写的。但是它可以选择退回到使用BeautifulSoup作为解析器:ElementSoup

I’d think this should suffice:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

… assuming that your text (HTML) is in a variable named “text.”

This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.

However

Don’t use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you’re going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

If your handling “real world” tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn’t in the standard libraries (yet) but is wide recommended for this purpose.

Another option is: lxml … which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.


Python提取模式匹配

问题:Python提取模式匹配

Python 2.7.1我正在尝试使用python正则表达式来提取模式内的单词

我有一些看起来像这样的字符串

someline abc
someother line
name my_user_name is valid
some more lines

我要提取单词“ my_user_name”。我做类似的事情

import re
s = #that big string
p = re.compile("name .* is valid", re.flags)
p.match(s) #this gives me <_sre.SRE_Match object at 0x026B6838>

如何立即提取my_user_name?

Python 2.7.1 I am trying to use python regular expression to extract words inside of a pattern

I have some string that looks like this

someline abc
someother line
name my_user_name is valid
some more lines

I want to extract the word “my_user_name”. I do something like

import re
s = #that big string
p = re.compile("name .* is valid", re.flags)
p.match(s) #this gives me <_sre.SRE_Match object at 0x026B6838>

How do I extract my_user_name now?


回答 0

您需要从正则表达式捕获。search对于模式,如果找到,请使用检索字符串group(index)。假设执行了有效的检查:

>>> p = re.compile("name (.*) is valid")
>>> result = p.search(s)
>>> result
<_sre.SRE_Match object at 0x10555e738>
>>> result.group(1)     # group(1) will return the 1st capture.
                        # group(0) will returned the entire matched text.
'my_user_name'

You need to capture from regex. search for the pattern, if found, retrieve the string using group(index). Assuming valid checks are performed:

>>> p = re.compile("name (.*) is valid")
>>> result = p.search(s)
>>> result
<_sre.SRE_Match object at 0x10555e738>
>>> result.group(1)     # group(1) will return the 1st capture (stuff within the brackets).
                        # group(0) will returned the entire matched text.
'my_user_name'

回答 1

您可以使用匹配组:

p = re.compile('name (.*) is valid')

例如

>>> import re
>>> p = re.compile('name (.*) is valid')
>>> s = """
... someline abc
... someother line
... name my_user_name is valid
... some more lines"""
>>> p.findall(s)
['my_user_name']

在这里,我使用re.findall而不是re.search获取的所有实例my_user_name。使用re.search,您需要从match对象上的组中获取数据:

>>> p.search(s)   #gives a match object or None if no match is found
<_sre.SRE_Match object at 0xf5c60>
>>> p.search(s).group() #entire string that matched
'name my_user_name is valid'
>>> p.search(s).group(1) #first group that match in the string that matched
'my_user_name'

如评论中所述,您可能希望使正则表达式不贪心:

p = re.compile('name (.*?) is valid')

只能提取到'name '下一个之间的内容' is valid'(而不是让您的正则表达式来提取' is valid'组中的其他内容。

You can use matching groups:

p = re.compile('name (.*) is valid')

e.g.

>>> import re
>>> p = re.compile('name (.*) is valid')
>>> s = """
... someline abc
... someother line
... name my_user_name is valid
... some more lines"""
>>> p.findall(s)
['my_user_name']

Here I use re.findall rather than re.search to get all instances of my_user_name. Using re.search, you’d need to get the data from the group on the match object:

>>> p.search(s)   #gives a match object or None if no match is found
<_sre.SRE_Match object at 0xf5c60>
>>> p.search(s).group() #entire string that matched
'name my_user_name is valid'
>>> p.search(s).group(1) #first group that match in the string that matched
'my_user_name'

As mentioned in the comments, you might want to make your regex non-greedy:

p = re.compile('name (.*?) is valid')

to only pick up the stuff between 'name ' and the next ' is valid' (rather than allowing your regex to pick up other ' is valid' in your group.


回答 2

您可以使用如下形式:

import re
s = #that big string
# the parenthesis create a group with what was matched
# and '\w' matches only alphanumeric charactes
p = re.compile("name +(\w+) +is valid", re.flags)
# use search(), so the match doesn't have to happen 
# at the beginning of "big string"
m = p.search(s)
# search() returns a Match object with information about what was matched
if m:
    name = m.group(1)
else:
    raise Exception('name not found')

You could use something like this:

import re
s = #that big string
# the parenthesis create a group with what was matched
# and '\w' matches only alphanumeric charactes
p = re.compile("name +(\w+) +is valid", re.flags)
# use search(), so the match doesn't have to happen 
# at the beginning of "big string"
m = p.search(s)
# search() returns a Match object with information about what was matched
if m:
    name = m.group(1)
else:
    raise Exception('name not found')

回答 3

也许这更短一些,更容易理解:

import re
text = '... someline abc... someother line... name my_user_name is valid.. some more lines'
>>> re.search('name (.*) is valid', text).group(1)
'my_user_name'

Maybe that’s a bit shorter and easier to understand:

import re
text = '... someline abc... someother line... name my_user_name is valid.. some more lines'
>>> re.search('name (.*) is valid', text).group(1)
'my_user_name'

回答 4

您需要一个捕获组

p = re.compile("name (.*) is valid", re.flags) # parentheses for capture groups
print p.match(s).groups() # This gives you a tuple of your matches.

You want a capture group.

p = re.compile("name (.*) is valid", re.flags) # parentheses for capture groups
print p.match(s).groups() # This gives you a tuple of your matches.

回答 5

您可以使用组(用'('和表示')')捕获字符串的一部分。然后,match对象的group()方法为您提供组的内容:

>>> import re
>>> s = 'name my_user_name is valid'
>>> match = re.search('name (.*) is valid', s)
>>> match.group(0)  # the entire match
'name my_user_name is valid'
>>> match.group(1)  # the first parenthesized subgroup
'my_user_name'

在Python 3.6及更高版本中,您也可以索引到match对象中,而不是使用group()

>>> match[0]  # the entire match 
'name my_user_name is valid'
>>> match[1]  # the first parenthesized subgroup
'my_user_name'

You can use groups (indicated with '(' and ')') to capture parts of the string. The match object’s group() method then gives you the group’s contents:

>>> import re
>>> s = 'name my_user_name is valid'
>>> match = re.search('name (.*) is valid', s)
>>> match.group(0)  # the entire match
'name my_user_name is valid'
>>> match.group(1)  # the first parenthesized subgroup
'my_user_name'

In Python 3.6+ you can also index into a match object instead of using group():

>>> match[0]  # the entire match 
'name my_user_name is valid'
>>> match[1]  # the first parenthesized subgroup
'my_user_name'

回答 6

这是一种无需使用组(Python 3.6或更高版本)的方法:

>>> re.search('2\d\d\d[01]\d[0-3]\d', 'report_20191207.xml')[0]
'20191207'

Here’s a way to do it without using groups (Python 3.6 or above):

>>> re.search('2\d\d\d[01]\d[0-3]\d', 'report_20191207.xml')[0]
'20191207'

回答 7

您还可以使用捕获组(?P<user>pattern)并像字典一样访问该组match['user']

string = '''someline abc\n
            someother line\n
            name my_user_name is valid\n
            some more lines\n'''

pattern = r'name (?P<user>.*) is valid'
matches = re.search(pattern, str(string), re.DOTALL)
print(matches['user'])

# my_user_name

You can also use a capture group (?P<user>pattern) and access the group like a dictionary match['user'].

string = '''someline abc\n
            someother line\n
            name my_user_name is valid\n
            some more lines\n'''

pattern = r'name (?P<user>.*) is valid'
matches = re.search(pattern, str(string), re.DOTALL)
print(matches['user'])

# my_user_name

回答 8

看来您实际上是在尝试提取名称,而只是找到一个匹配项。在这种情况下,为您的比赛设置跨度索引会有所帮助,我建议您使用re.finditer。作为快捷方式,您知道name正则表达式的部分是长度5,而is valid长度是9,因此您可以对匹配的文本进行切片以提取名称。

注意-在您的示例中,它看起来像是s带有换行符的字符串,因此以下假设。

## covert s to list of strings separated by line:
s2 = s.splitlines()

## find matches by line: 
for i, j in enumerate(s2):
    matches = re.finditer("name (.*) is valid", j)
    ## ignore lines without a match
    if matches:
        ## loop through match group elements
        for k in matches:
            ## get text
            match_txt = k.group(0)
            ## get line span
            match_span = k.span(0)
            ## extract username
            my_user_name = match_txt[5:-9]
            ## compare with original text
            print(f'Extracted Username: {my_user_name} - found on line {i}')
            print('Match Text:', match_txt)

It seems like you’re actually trying to extract a name vice simply find a match. If this is the case, having span indexes for your match is helpful and I’d recommend using re.finditer. As a shortcut, you know the name part of your regex is length 5 and the is valid is length 9, so you can slice the matching text to extract the name.

Note – In your example, it looks like s is string with line breaks, so that’s what’s assumed below.

## covert s to list of strings separated by line:
s2 = s.splitlines()

## find matches by line: 
for i, j in enumerate(s2):
    matches = re.finditer("name (.*) is valid", j)
    ## ignore lines without a match
    if matches:
        ## loop through match group elements
        for k in matches:
            ## get text
            match_txt = k.group(0)
            ## get line span
            match_span = k.span(0)
            ## extract username
            my_user_name = match_txt[5:-9]
            ## compare with original text
            print(f'Extracted Username: {my_user_name} - found on line {i}')
            print('Match Text:', match_txt)

在Python 3中加速数百万个正则表达式的替换

问题:在Python 3中加速数百万个正则表达式的替换

我正在使用Python 3.5.2

我有两个清单

  • 大约750,000个“句子”(长字符串)的列表
  • 我想从我的750,000个句子中删除的大约20,000个“单词”的列表

因此,我必须遍历750,000个句子并执行大约20,000个替换,但前提是我的单词实际上是“单词”,并且不属于较大的字符串。

我这样做是通过预编译我的单词,使它们位于\b元字符的侧面

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

然后我遍历我的“句子”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

这个嵌套循环每秒处理大约50个句子,这很好,但是处理我所有的句子仍需要几个小时。

  • 有没有一种方法可以使用该str.replace方法(我认为该方法更快),但仍然要求仅在单词边界处进行替换?

  • 或者,有没有办法加快该re.sub方法?re.sub如果单词的长度大于句子的长度,我已经略微提高了速度,但这并没有太大的改进。

感谢您的任何建议。

I’m using Python 3.5.2

I have two lists

  • a list of about 750,000 “sentences” (long strings)
  • a list of about 20,000 “words” that I would like to delete from my 750,000 sentences

So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my “sentences”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.

Thank you for any suggestions.


回答 0

您可以尝试做的一件事是编译一个单一模式,例如"\b(word1|word2|word3)\b"

由于re依靠C代码进行实际匹配,因此节省的费用可观。

正如@pvg在评论中指出的,它也受益于单遍匹配。

如果您的单词不是正则表达式,那么Eric的答案会更快。

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric’s answer is faster.


回答 1

TLDR

如果您想要最快的解决方案,请使用此方法(带有设置的查找)。对于类似于OP的数据集,它比接受的答案快大约2000倍。

如果您坚持使用正则表达式进行查找,请使用此基于Trie的版本,该版本仍比正则表达式联合快1000倍。

理论

如果您的句子不是笨拙的字符串,每秒处理50个以上的句子可能是可行的。

如果将所有禁止的单词保存到集合中,则可以非常快速地检查该集合中是否包含另一个单词。

将逻辑打包到一个函数中,将此函数作为参数提供给re.sub您,您就完成了!

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

转换后的句子为:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

注意:

  • 搜索不区分大小写(感谢lower()
  • 用替换一个单词""可能会留下两个空格(如您的代码中所示)
  • 使用python3,\w+还可以匹配带重音符号的字符(例如"ångström")。
  • 任何非单词字符(制表符,空格,换行符,标记等)都将保持不变。

性能

一百万个句子,banned_words近十万个单词,脚本运行时间不到7秒。

相比之下,Liteye的答案需要1万个句子需要160秒。

由于n是单词的总数和m被禁止的单词的数量,OP和Liteye的代码为O(n*m)

相比之下,我的代码应在中运行O(n+m)。考虑到句子比禁止词多得多,该算法变为O(n)

正则表达式联合测试

使用'\b(word1|word2|...|wordN)\b'模式进行正则表达式搜索的复杂性是什么?是O(N)还是O(1)

很难了解正则表达式引擎的工作方式,因此让我们编写一个简单的测试。

此代码将10**i随机的英语单词提取到列表中。它创建相应的正则表达式联合,并用不同的词对其进行测试:

  • 一个人显然不是一个词(以开头#
  • 一个是列表中的第一个单词
  • 一个是列表中的最后一个单词
  • 一个看起来像一个单词,但不是


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

它输出:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

因此,看起来像一个带有'\b(word1|word2|...|wordN)\b'模式的单词的搜索具有:

  • O(1) 最好的情况
  • O(n/2) 一般情况,仍然 O(n)
  • O(n) 最糟糕的情况

这些结果与简单的循环搜索一致。

regex联合的一种更快的替代方法是从trie创建regex模式

TLDR

Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP’s, it’s approximately 2000 times faster than the accepted answer.

If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

Theory

If your sentences aren’t humongous strings, it’s probably feasible to process many more than 50 per second.

If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

Pack the logic into a function, give this function as argument to re.sub and you’re done!

Code

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

Converted sentences are:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

Note that:

  • the search is case-insensitive (thanks to lower())
  • replacing a word with "" might leave two spaces (as in your code)
  • With python3, \w+ also matches accented characters (e.g. "ångström").
  • Any non-word character (tab, space, newline, marks, …) will stay untouched.

Performance

There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

In comparison, Liteye’s answer needed 160s for 10 thousand sentences.

With n being the total amound of words and m the amount of banned words, OP’s and Liteye’s code are O(n*m).

In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

Regex union test

What’s the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

It’s pretty hard to grasp the way the regex engine works, so let’s write a simple test.

This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

  • one is clearly not a word (it begins with #)
  • one is the first word in the list
  • one is the last word in the list
  • one looks like a word but isn’t


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

It outputs:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

  • O(1) best case
  • O(n/2) average case, which is still O(n)
  • O(n) worst case

These results are consistent with a simple loop search.

A much faster alternative to a regex union is to create the regex pattern from a trie.


回答 2

TLDR

如果您想要最快的基于正则表达式的解决方案,请使用此方法。对于类似于OP的数据集,它比接受的答案快大约1000倍。

如果您不关心正则表达式,请使用此基于集合的版本,它比正则表达式联合快2000倍。

使用Trie优化正则表达式

一个简单的正则表达式工会的做法与许多禁用词语变得缓慢,这是因为正则表达式引擎不会做了很好的工作优化格局。

可以使用所有禁止的单词创建Trie并编写相应的正则表达式。生成的trie或regex并不是真正的人类可读的,但是它们确实允许非常快速的查找和匹配。

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

该列表将转换为特里:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

然后到此正则表达式模式:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

巨大的优势在于,要测试是否zoo匹配,正则表达式引擎只需比较第一个字符(不匹配),而无需尝试5个单词。这是5个单词的预处理过大杀伤力,但它显示了成千上万个单词的有希望的结果。

请注意,使用(?:)非捕获组是因为:

  • foobar|baz将匹配foobarbaz但不匹配foobaz
  • foo(bar|baz)将不需要的信息保存到捕获组

这是一个经过稍微修改的gist,我们可以将其用作trie.py库:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

测试

这是一个小测试(与测试相同):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

它输出:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

对于信息,正则表达式开始如下:

(?:a(?:(?:\’s | a(?:\’s | chen | liyah(?:\’s)?| r(?:dvark(?:(?:\’s | s ))?|| on))| b(?:\’s | a(?:c(?:us(?:(?:\’s | es))?| [ik])| ft | lone(? :(?:\’s | s))?| ndon(?:( ?: ed | ing | ment(?:\’s)?| s))?| s(?:e(?:( ?: ment(?:\’s)?| [ds]))?| h(?:( ?: e [ds] | ing))?| ing)| t(?:e(?:( ?: ment( ?:\’s)?| [ds]))?| ing | toir(?:(?:\’s | s))?))| b(?:as(?:id)?| e(? :ss(?:(?:\’s | es))?| y(?:(?:\’s | s))?)| ot(?:(?:\’s | t(?:\ ‘s)?| s))?| reviat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))| y(?:\’ s)?| \é(?:(?:\’s | s))?)| d(?:icat(?:e [ds]?| i(?:ng | on(?:(?:\ ‘s | s))?)))| om(?:en(?:(?:\’s | s))?| inal)| u(?:ct(?:( ?: ed | i(?: ng | on(?:(?:\’s | s))?)|或(?:(?:\’s | s))?| s))?| l(?:\’s)?) )| e(?:(?:\’s | am | l(?:(?:\’s | ard | son(?:\’s)?)))?| r(?:deen(?:\ ‘s)?| nathy(?:\’s)?| ra(?:nt | tion(?:(?:\’s | s))?))| t(?:( ?: t(?: e(?:r(?:(?:\’s | s))?| d)| ing | or(?:(?:\’s | s))?)| s))?| yance(?:\’s)?| d))?| hor(?:( ?: r(?:e(?:n(?:ce(? :\’s)?| t)| d)| ing)| s)))| i(?:d(?:e [ds]?| ing | jan(?:\’s)?)|盖尔| l(?:ene | it(?:ies | y(?:\’s)?)))| j(?:ect(?:ly)?| ur(?:ation(?:(?:\’ s | s))?| e [ds]?| ing))| l(?:a(?:tive(?:(?:\’s | s))?| ze)| e(?:(? :st | r))?| oom | ution(?:(?:\’s | s))?| y)| m \’s | n(?:e(?:gat(?:e [ds] || i(?:ng | on(?:\’s)?))| r(?:\’s)?)| ormal(?:( ?: it(?:ies | y(?:\’ s)?)| ly))?)| o(?:ard | de(?:(?:\’s | s))?| li(?:sh(?:( ?: e [ds] | ing ))|| tion(?:(?:\’s | ist(?:(?:\’s | s))?))?)| mina(?:bl [ey] | t(?:e [ ds]?| i(?:ng | on(?:(?:\’s | s))?))))| r(?:igin(?:al(?:(?:\’s | s) )?| e(?:(?:\’s | s))?)| t(?:( ?: ed | i(?:ng | on(?:(?:\’s | ist(?: (?:\’s | s))?| s))?| ve)| s))))| u(?:nd(?:(?:( ?: ed | ing | s |))?| t)| ve (?:(?:\’s | board))?)| r(?:a(?:cadabra(?:\’s)?| d(?:e [ds]?| ing)| ham(? :\’s)?| m(?:(?:\’s | s))?| si(?:on(?:(?:\’s | s))?| ve(?:( ?:\’s | ly | ness(?:\’s)?| s))?))| east | idg(?:e(?:( ?: ment(?:((?:\’s | s))) ?| [ds]))?| ing | ment(?:(?:\’s | s))?)| o(?:ad | gat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))))| upt(?:( ?: e(?:st | r)| ly | ness(?:\’s)?))?)) | s(?:alom | c(?:ess(?:(?:\’s | e [ds] | ing)))?| issa(?:(?:\’s | [es])))?| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?:( ?: e(?:e( ?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))))| inth(?:(?:\’s | e( ?:o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))?| i(?:on(?: \’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?: e(?:n(? :cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti …s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。

这确实让人难以理解,但是对于100000个禁用词的列表而言,此Trie regex比简单的regex联合快1000倍!

这是完整的trie的图,并通过trie-python-graphviz和graphviz 导出twopi

TLDR

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP’s, it’s approximately 1000 times faster than the accepted answer.

If you don’t care about regex, use this set-based version, which is 2000 times faster than a regex union.

Optimized Regex with Trie

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn’t do a very good job of optimizing the pattern.

It’s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren’t really human-readable, but they do allow for very fast lookup and match.

Example

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

The list is converted to a trie:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

And then to this regex pattern:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn’t match), instead of trying the 5 words. It’s a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that (?:) non-capturing groups are used because:

Code

Here’s a slightly modified gist, which we can use as a trie.py library:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

Test

Here’s a small test (the same as this one):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

It outputs:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

For info, the regex begins like this:

(?:a(?:(?:\’s|a(?:\’s|chen|liyah(?:\’s)?|r(?:dvark(?:(?:\’s|s))?|on))|b(?:\’s|a(?:c(?:us(?:(?:\’s|es))?|[ik])|ft|lone(?:(?:\’s|s))?|ndon(?:(?:ed|ing|ment(?:\’s)?|s))?|s(?:e(?:(?:ment(?:\’s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\’s)?|[ds]))?|ing|toir(?:(?:\’s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\’s|es))?|y(?:(?:\’s|s))?)|ot(?:(?:\’s|t(?:\’s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|y(?:\’s)?|\é(?:(?:\’s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|om(?:en(?:(?:\’s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\’s|s))?)|or(?:(?:\’s|s))?|s))?|l(?:\’s)?))|e(?:(?:\’s|am|l(?:(?:\’s|ard|son(?:\’s)?))?|r(?:deen(?:\’s)?|nathy(?:\’s)?|ra(?:nt|tion(?:(?:\’s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\’s|s))?|d)|ing|or(?:(?:\’s|s))?)|s))?|yance(?:\’s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\’s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\’s)?)|gail|l(?:ene|it(?:ies|y(?:\’s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\’s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\’s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\’s|s))?|y)|m\’s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\’s)?))|r(?:\’s)?)|ormal(?:(?:it(?:ies|y(?:\’s)?)|ly))?)|o(?:ard|de(?:(?:\’s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\’s|ist(?:(?:\’s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|r(?:igin(?:al(?:(?:\’s|s))?|e(?:(?:\’s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\’s|ist(?:(?:\’s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\’s|board))?)|r(?:a(?:cadabra(?:\’s)?|d(?:e[ds]?|ing)|ham(?:\’s)?|m(?:(?:\’s|s))?|si(?:on(?:(?:\’s|s))?|ve(?:(?:\’s|ly|ness(?:\’s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\’s|s))?|[ds]))?|ing|ment(?:(?:\’s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\’s)?))?)|s(?:alom|c(?:ess(?:(?:\’s|e[ds]|ing))?|issa(?:(?:\’s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\’s|s))?|t(?:(?:e(?:e(?:(?:\’s|ism(?:\’s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\’s|e(?:\’s)?))?|o(?:l(?:ut(?:e(?:(?:\’s|ly|st?))?|i(?:on(?:\’s)?|sm(?:\’s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\’s)?|t(?:(?:\’s|s))?)|d)|ing|s))?|pti…

It’s really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

Here’s a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:


回答 3

您可能想尝试的一件事是对句子进行预处理以对单词边界进行编码。基本上,通过划分单词边界将每个句子变成单词列表。

这应该更快,因为要处理一个句子,您只需要逐步检查每个单词并检查它是否匹配即可。

当前,正则表达式搜索每次必须再次遍历整个字符串,以查找单词边界,然后在下一次遍历之前“舍弃”这项工作的结果。

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.

This should be faster, because to process a sentence, you just have to step through each of the words and check if it’s a match.

Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then “discarding” the result of this work before the next pass.


回答 4

好吧,这是一个快速简单的解决方案,带有测试仪。

取胜策略:

re.sub(“ \ w +”,repl,sentence)搜索单词。

“ repl”可以是可调用的。我使用了一个执行字典查找的函数,该字典包含要搜索和替换的单词。

这是最简单,最快的解决方案(请参见下面的示例代码中的函数replace4)。

次好的

想法是使用re.split将句子拆分为单词,同时保留分隔符以稍后重建句子。然后,通过简单的字典查找完成替换。

(请参见下面的示例代码中的函数replace3)。

功能示例的时间:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…和代码:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

编辑:检查是否传递小写的句子列表并编辑repl时,您也可以忽略小写

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

Well, here’s a quick and easy solution, with test set.

Winning strategy:

re.sub(“\w+”,repl,sentence) searches for words.

“repl” can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.

This is the simplest and fastest solution (see function replace4 in example code below).

Second best

The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.

(see function replace3 in example code below).

Timings for example functions:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…and code:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

Edit: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

回答 5

也许Python不是这里的正确工具。这是Unix工具链中的一个

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

假设您的黑名单文件已经过预处理,并添加了字边界。步骤是:将文件转换为双倍行距,将每个句子拆分为每行一个单词,从文件中批量删除黑名单单词,然后合并回行。

这应该至少快一个数量级。

用于从单词中预处理黑名单文件(每行一个单词)

sed 's/.*/\\b&\\b/' words > blacklist

Perhaps Python is not the right tool here. Here is one with the Unix toolchain

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

This should run at least an order of magnitude faster.

For preprocessing the blacklist file from words (one word per line)

sed 's/.*/\\b&\\b/' words > blacklist

回答 6

这个怎么样:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

这些解决方案在单词边界上划分并查找集合中的每个单词。它们应该比re.sub单词替代(Liteyes的解决方案)更快,因为这些解决方案是O(n),其中n是由于amortized O(1)设置查找而导致的,而使用正则表达式替代项将导致regex引擎必须检查单词是否匹配在每个字符上,而不仅仅是在单词边界上。我的解决方案a格外小心,以保留原始文本中使用的空格(即,它不压缩空格,并保留制表符,换行符和其他空格字符),但是如果您决定不关心它,则可以从输出中删除它们应该非常简单。

我在corpus.txt上进行了测试,corpus.txt是从Gutenberg Project下载的多本电子书的串联,并且banned_words.txt是从Ubuntu的单词表(/ usr / share / dict / american-english)中随机选择的20000个单词。处理862462个句子(约占PyPy的一半)大约需要30秒。我已将句子定义为以“。”分隔的任何内容。

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy特别受益于第二种方法,而CPython在第一种方法上表现更好。上面的代码在Python 2和Python 3上都可以使用。

How about this:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes’ solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn’t compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don’t care about it, it should be fairly straightforward to remove them from the output.

I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu’s wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I’ve defined sentences as anything separated by “. “.

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.


回答 7

实用方法

下述解决方案使用大量内存将所有文本存储在同一字符串中,并降低了复杂度。如果RAM是一个问题,请在使用前三思。

使用join/ split技巧,您可以完全避免循环,从而可以加快算法的速度。

  • 用特殊分隔符连接句子,这些特殊分隔符不包含在句子中:
  • merged_sentences = ' * '.join(sentences)

  • 使用|“或”正则表达式语句为需要从句子中摆脱的所有单词编译一个正则表达式:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

  • 用已编译的正则表达式对单词下标,并用特殊的分隔符将其拆分回单独的句子:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

    性能

    "".join复杂度为O(n)。这是非常直观的,但是无论如何都会有一个简短的报价来源:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);

    因此,join/split有了O(words)+ 2 * O(sentences)仍然是线性复杂度,而初始方法为2 * O(N 2)。


    顺便说一句,不要使用多线程。GIL将阻止每个操作,因为您的任务严格地受CPU限制,因此GIL没有机会被释放,但是每个线程将同时发送滴答声,这会导致额外的工作量,甚至导致操作达到无穷大。

    Practical approach

    A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.

    With join/split tricks you can avoid loops at all which should speed up the algorithm.

  • Concatenate a sentences with a special delimeter which is not contained by the sentences:
  • merged_sentences = ' * '.join(sentences)
    

  • Compile a single regex for all the words you need to rid from the sentences using | “or” regex statement:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
    

  • Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
    

    Performance

    "".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);
    

    Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.


    b.t.w. don’t use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.


    回答 8

    将所有句子连接到一个文档中。使用Aho-Corasick算法的任何实现(这里是)来查找所有“不好”的单词。遍历文件,替换每个坏词,更新后跟的发现词的偏移量等。

    Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here’s one) to locate all your “bad” words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.


    在Python字符串中转义正则表达式特殊字符

    问题:在Python字符串中转义正则表达式特殊字符

    Python是否具有可用来在正则表达式中转义特殊字符的函数?

    例如,I'm "stuck" :\应成为I\'m \"stuck\" :\\

    Does Python have a function that I can use to escape special characters in a regular expression?

    For example, I'm "stuck" :\ should become I\'m \"stuck\" :\\.


    回答 0

    re.escape

    >>> import re
    >>> re.escape(r'\ a.*$')
    '\\\\\\ a\\.\\*\\$'
    >>> print(re.escape(r'\ a.*$'))
    \\\ a\.\*\$
    >>> re.escape('www.stackoverflow.com')
    'www\\.stackoverflow\\.com'
    >>> print(re.escape('www.stackoverflow.com'))
    www\.stackoverflow\.com
    

    在这里重复:

    re.escape(字符串)

    返回所有非字母数字加反斜杠的字符串;如果您想匹配其中可能包含正则表达式元字符的任意文字字符串,这将很有用。

    从Python 3.7 re.escape()开始,更改为仅转义对正则表达式操作有意义的字符。

    Use re.escape

    >>> import re
    >>> re.escape(r'\ a.*$')
    '\\\\\\ a\\.\\*\\$'
    >>> print(re.escape(r'\ a.*$'))
    \\\ a\.\*\$
    >>> re.escape('www.stackoverflow.com')
    'www\\.stackoverflow\\.com'
    >>> print(re.escape('www.stackoverflow.com'))
    www\.stackoverflow\.com
    

    Repeating it here:

    re.escape(string)

    Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

    As of Python 3.7 re.escape() was changed to escape only characters which are meaningful to regex operations.


    回答 1

    我很惊讶没有人提到通过re.sub()以下方式使用正则表达式:

    import re
    print re.sub(r'([\"])',    r'\\\1', 'it\'s "this"')  # it's \"this\"
    print re.sub(r"([\'])",    r'\\\1', 'it\'s "this"')  # it\'s "this"
    print re.sub(r'([\" \'])', r'\\\1', 'it\'s "this"')  # it\'s\ \"this\"

    重要注意事项:

    • 搜索模式中,包括\您要查找的字符。你会使用\逃脱你的角色,所以你需要逃避 为好。
    • 搜索模式周围加上括号,例如([\"]),以便替换 模式在找到的字符添加\到其前面时可以使用该字符。(这就是 \1作用:使用第一个带括号的组的值。)
    • r前面r'([\"])'意味着它是一个原始字符串。原始字符串使用不同的规则来转义反斜杠。要([\"])以纯字符串形式编写,您需要将所有反斜杠加倍,并写入'([\\"])'。在编写正则表达式时,原始字符串更友好。
    • 替代模式,你需要转义\从先于一个取代基的反斜杠,例如区分\1,因此r'\\\1'。写 的是作为一个普通的字符串,你需要'\\\\\\1'-大家都不希望发生。

    I’m surprised no one has mentioned using regular expressions via re.sub():

    import re
    print re.sub(r'([\"])',    r'\\\1', 'it\'s "this"')  # it's \"this\"
    print re.sub(r"([\'])",    r'\\\1', 'it\'s "this"')  # it\'s "this"
    print re.sub(r'([\" \'])', r'\\\1', 'it\'s "this"')  # it\'s\ \"this\"
    

    Important things to note:

    • In the search pattern, include \ as well as the character(s) you’re looking for. You’re going to be using \ to escape your characters, so you need to escape that as well.
    • Put parentheses around the search pattern, e.g. ([\"]), so that the substitution pattern can use the found character when it adds \ in front of it. (That’s what \1 does: uses the value of the first parenthesized group.)
    • The r in front of r'([\"])' means it’s a raw string. Raw strings use different rules for escaping backslashes. To write ([\"]) as a plain string, you’d need to double all the backslashes and write '([\\"])'. Raw strings are friendlier when you’re writing regular expressions.
    • In the substitution pattern, you need to escape \ to distinguish it from a backslash that precedes a substitution group, e.g. \1, hence r'\\\1'. To write that as a plain string, you’d need '\\\\\\1' — and nobody wants that.

    回答 2

    使用repr()[1:-1]。在这种情况下,双引号不需要转义。[-1:1]切片是从开头和结尾删除单引号。

    >>> x = raw_input()
    I'm "stuck" :\
    >>> print x
    I'm "stuck" :\
    >>> print repr(x)[1:-1]
    I\'m "stuck" :\\

    或者,也许您只是想转义一个短语以粘贴到您的程序中?如果是这样,请执行以下操作:

    >>> raw_input()
    I'm "stuck" :\
    'I\'m "stuck" :\\'

    Use repr()[1:-1]. In this case, the double quotes don’t need to be escaped. The [-1:1] slice is to remove the single quote from the beginning and the end.

    >>> x = raw_input()
    I'm "stuck" :\
    >>> print x
    I'm "stuck" :\
    >>> print repr(x)[1:-1]
    I\'m "stuck" :\\
    

    Or maybe you just want to escape a phrase to paste into your program? If so, do this:

    >>> raw_input()
    I'm "stuck" :\
    'I\'m "stuck" :\\'
    

    回答 3

    如上所述,答案取决于您的情况。如果要转义正则表达式的字符串,则应使用re.escape()。但是,如果要转义一组特定的字符,请使用此lambda函数:

    >>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
    >>> s = raw_input()
    I'm "stuck" :\
    >>> print s
    I'm "stuck" :\
    >>> print escape(s, "\\", ['"'])
    I'm \"stuck\" :\\

    As it was mentioned above, the answer depends on your case. If you want to escape a string for a regular expression then you should use re.escape(). But if you want to escape a specific set of characters then use this lambda function:

    >>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
    >>> s = raw_input()
    I'm "stuck" :\
    >>> print s
    I'm "stuck" :\
    >>> print escape(s, "\\", ['"'])
    I'm \"stuck\" :\\
    

    回答 4

    这并不难:

    def escapeSpecialCharacters ( text, characters ):
        for character in characters:
            text = text.replace( character, '\\' + character )
        return text
    
    >>> escapeSpecialCharacters( 'I\'m "stuck" :\\', '\'"' )
    'I\\\'m \\"stuck\\" :\\'
    >>> print( _ )
    I\'m \"stuck\" :\

    It’s not that hard:

    def escapeSpecialCharacters ( text, characters ):
        for character in characters:
            text = text.replace( character, '\\' + character )
        return text
    
    >>> escapeSpecialCharacters( 'I\'m "stuck" :\\', '\'"' )
    'I\\\'m \\"stuck\\" :\\'
    >>> print( _ )
    I\'m \"stuck\" :\
    

    回答 5

    如果只想替换某些字符,则可以使用以下命令:

    import re
    
    print re.sub(r'([\.\\\+\*\?\[\^\]\$\(\)\{\}\!\<\>\|\:\-])', r'\\\1', "example string.")

    If you only want to replace some characters you could use this:

    import re
    
    print re.sub(r'([\.\\\+\*\?\[\^\]\$\(\)\{\}\!\<\>\|\:\-])', r'\\\1', "example string.")
    

    替换Python中第一次出现的字符串

    问题:替换Python中第一次出现的字符串

    我有一些示例字符串。如何用空字符串替换长字符串中第一次出现的该字符串?

    regex = re.compile('text')
    match = regex.match(url)
    if match:
        url = url.replace(regex, '')
    

    I have some sample string. How can I replace first occurrence of this string in a longer string with empty string?

    regex = re.compile('text')
    match = regex.match(url)
    if match:
        url = url.replace(regex, '')
    

    回答 0

    字符串replace()函数可以完美解决此问题:

    string.replace(s,old,new [,maxreplace])

    返回字符串s的副本,其中所有出现的子字符串old都被new替换。如果给出了可选参数maxreplace,则替换第一个出现的maxreplace。

    >>> u'longlongTESTstringTEST'.replace('TEST', '?', 1)
    u'longlong?stringTEST'
    

    string replace() function perfectly solves this problem:

    string.replace(s, old, new[, maxreplace])

    Return a copy of string s with all occurrences of substring old replaced by new. If the optional argument maxreplace is given, the first maxreplace occurrences are replaced.

    >>> u'longlongTESTstringTEST'.replace('TEST', '?', 1)
    u'longlong?stringTEST'
    

    回答 1

    re.sub直接使用,可让您指定count

    regex.sub('', url, 1)

    (请注意,参数的顺序是replacementoriginal而不是相反的,这可能令人怀疑。)

    Use re.sub directly, this allows you to specify a count:

    regex.sub('', url, 1)
    

    (Note that the order of arguments is replacement, original not the opposite, as might be suspected.)


    Python正则表达式返回true / false

    问题:Python正则表达式返回true / false

    使用Python正则表达式如何获得True/ False返回?所有Python回报是:

    <_sre.SRE_Match object at ...>

    Using Python regular expressions how can you get a True/False returned? All Python returns is:

    <_sre.SRE_Match object at ...>
    

    回答 0

    Match对象始终为true,None如果不匹配,则返回。只是测试真实性。

    if re.match(...):

    Match objects are always true, and None is returned if there is no match. Just test for trueness.

    if re.match(...):
    

    回答 1

    如果您确实需要TrueFalse,请使用bool

    >>> bool(re.search("hi", "abcdefghijkl"))
    True
    >>> bool(re.search("hi", "abcdefgijkl"))
    False

    正如其他答案所指出的那样,如果您只是将其用作ifor 的条件while,则可以直接使用它而无需将其包装bool()

    If you really need True or False, just use bool

    >>> bool(re.search("hi", "abcdefghijkl"))
    True
    >>> bool(re.search("hi", "abcdefgijkl"))
    False
    

    As other answers have pointed out, if you are just using it as a condition for an if or while, you can use it directly without wrapping in bool()


    回答 2

    伊格纳西奥·巴斯克斯(Ignacio Vazquez-Abrams)是正确的。但要详细说明,re.match()将返回None,其结果为False,或者返回一个匹配对象,该对象将始终True如他所说。仅当您需要有关与正则表达式匹配的部分的信息时,才需要签出匹配对象的内容。

    Ignacio Vazquez-Abrams is correct. But to elaborate, re.match() will return either None, which evaluates to False, or a match object, which will always be True as he said. Only if you want information about the part(s) that matched your regular expression do you need to check out the contents of the match object.


    回答 3

    一种方法是仅针对返回值进行测试。因为您得到<_sre.SRE_Match object at ...>它意味着它将评估为true。当正则表达式不匹配时,您将返回无值,其结果为false。

    import re
    
    if re.search("c", "abcdef"):
        print "hi"

    产生hi为输出。

    One way to do this is just to test against the return value. Because you’re getting <_sre.SRE_Match object at ...> it means that this will evaluate to true. When the regular expression isn’t matched you’ll the return value None, which evaluates to false.

    import re
    
    if re.search("c", "abcdef"):
        print "hi"
    

    Produces hi as output.


    回答 4

    这是我的方法:

    import re
    # Compile
    p = re.compile(r'hi')
    # Match and print
    print bool(p.match("abcdefghijkl"))

    Here is my method:

    import re
    # Compile
    p = re.compile(r'hi')
    # Match and print
    print bool(p.match("abcdefghijkl"))
    

    回答 5

    您可以使用re.match()re.search()。Python提供了两种基于正则表达式的原始操作:re.match()仅在字符串的开头re.search()检查匹配项,而在字符串中的任意位置检查匹配项(这是Perl的默认设置)。参考这个

    You can use re.match() or re.search(). Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default). refer this


    基于正则表达式的Python拆分字符串

    问题:基于正则表达式的Python拆分字符串

    "HELLO there HOW are YOU"用大写单词分割字符串的最佳方法是什么(在Python中)?

    所以我最终得到一个像这样的数组: results = ['HELLO there', 'HOW are', 'YOU']


    编辑:

    我努力了:

    p = re.compile("\b[A-Z]{2,}\b")
    print p.split(page_text)

    不过,它似乎不起作用。

    What is the best way to split a string like "HELLO there HOW are YOU" by upper case words (in Python)?

    So I’d end up with an array like such: results = ['HELLO there', 'HOW are', 'YOU']


    EDIT:

    I have tried:

    p = re.compile("\b[A-Z]{2,}\b")
    print p.split(page_text)
    

    It doesn’t seem to work, though.


    回答 0

    我建议

    l = re.compile("(?<!^)\s+(?=[A-Z])(?!.\s)").split(s)

    检查这个演示

    I suggest

    l = re.compile("(?<!^)\s+(?=[A-Z])(?!.\s)").split(s)
    

    Check this demo.


    回答 1

    您可以先行使用:

    re.split(r'[ ](?=[A-Z]+\b)', input)

    这将在每个空格处分开,后跟一串大写字母,每个大写字母以单词边界结尾。

    请注意,方括号仅出于可读性考虑,也可以省略。

    如果一个单词的第一个字母大写就足够了(因此,如果您也想在其前面拆分Hello),它将变得更加容易:

    re.split(r'[ ](?=[A-Z])', input)

    现在,这会在每个空格处分开,后跟任何大写字母。

    You could use a lookahead:

    re.split(r'[ ](?=[A-Z]+\b)', input)
    

    This will split at every space that is followed by a string of upper-case letters which end in a word-boundary.

    Note that the square brackets are only for readability and could as well be omitted.

    If it is enough that the first letter of a word is upper case (so if you would want to split in front of Hello as well) it gets even easier:

    re.split(r'[ ](?=[A-Z])', input)
    

    Now this splits at every space followed by any upper-case letter.


    回答 2

    您的问题包含字符串文字"\b[A-Z]{2,}\b",但是这\b意味着退格,因为没有r-修饰符。

    试试:r"\b[A-Z]{2,}\b"

    Your question contains the string literal "\b[A-Z]{2,}\b", but that \b will mean backspace, because there is no r-modifier.

    Try: r"\b[A-Z]{2,}\b".


    在括号之间返回文本的正则表达式

    问题:在括号之间返回文本的正则表达式

    u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'

    我需要的只是括号内的内容。

    u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
    

    All I need is the contents inside the parenthesis.


    回答 0

    如果您的问题确实如此简单,则不需要正则表达式:

    s[s.find("(")+1:s.find(")")]

    If your problem is really just this simple, you don’t need regex:

    s[s.find("(")+1:s.find(")")]
    

    回答 1

    用途re.search(r'\((.*?)\)',s).group(1)

    >>> import re
    >>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
    >>> re.search(r'\((.*?)\)',s).group(1)
    u"date='2/xc2/xb2',time='/case/test.png'"

    Use re.search(r'\((.*?)\)',s).group(1):

    >>> import re
    >>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
    >>> re.search(r'\((.*?)\)',s).group(1)
    u"date='2/xc2/xb2',time='/case/test.png'"
    

    回答 2

    如果要查找所有事件:

    >>> re.findall('\(.*?\)',s)
    [u"(date='2/xc2/xb2',time='/case/test.png')", u'(eee)']
    
    >>> re.findall('\((.*?)\)',s)
    [u"date='2/xc2/xb2',time='/case/test.png'", u'eee']

    If you want to find all occurences:

    >>> re.findall('\(.*?\)',s)
    [u"(date='2/xc2/xb2',time='/case/test.png')", u'(eee)']
    
    >>> re.findall('\((.*?)\)',s)
    [u"date='2/xc2/xb2',time='/case/test.png'", u'eee']
    

    回答 3

    如果您碰巧像这样嵌套嵌套括号,请以tkerwin的答案为基础

    st = "sum((a+b)/(c+d))"

    如果您需要将第一个开括号最后一个闭括号之间的所有内容都取成get (a+b)/(c+d),他的答案将不起作用,因为find从字符串的左侧开始搜索,并且会在第一个闭括号处停止。

    要解决此问题,您需要使用rfind该操作的第二部分,因此它将变成

    st[st.find("(")+1:st.rfind(")")]

    Building on tkerwin’s answer, if you happen to have nested parentheses like in

    st = "sum((a+b)/(c+d))"
    

    his answer will not work if you need to take everything between the first opening parenthesis and the last closing parenthesis to get (a+b)/(c+d), because find searches from the left of the string, and would stop at the first closing parenthesis.

    To fix that, you need to use rfind for the second part of the operation, so it would become

    st[st.find("(")+1:st.rfind(")")]
    

    回答 4

    import re
    
    fancy = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
    
    print re.compile( "\((.*)\)" ).search( fancy ).group( 1 )
    import re
    
    fancy = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
    
    print re.compile( "\((.*)\)" ).search( fancy ).group( 1 )
    

    回答 5

    contents_re = re.match(r'[^\(]*\((?P<contents>[^\(]+)\)', data)
    if contents_re:
        print(contents_re.groupdict()['contents'])
    contents_re = re.match(r'[^\(]*\((?P<contents>[^\(]+)\)', data)
    if contents_re:
        print(contents_re.groupdict()['contents'])
    

    Python正则表达式-如何获取匹配项的位置和值

    问题:Python正则表达式-如何获取匹配项的位置和值

    如何使用该re模块获取所有比赛的开始和结束位置?例如给定的模式r'[a-z]'和字符串,'a1b2c3d4'我想获得它找到每个字母的位置。理想情况下,我也想找回比赛的文字。

    How can I get the start and end positions of all matches using the re module? For example given the pattern r'[a-z]' and the string 'a1b2c3d4' I’d want to get the positions where it finds each letter. Ideally, I’d like to get the text of the match back too.


    回答 0

    import re
    p = re.compile("[a-z]")
    for m in p.finditer('a1b2c3d4'):
        print(m.start(), m.group())
    import re
    p = re.compile("[a-z]")
    for m in p.finditer('a1b2c3d4'):
        print(m.start(), m.group())
    

    回答 1

    取自

    正则表达式操作方法

    span()在单个元组中返回开始索引和结束索引。由于match方法仅检查RE是否在字符串开头匹配,因此start()始终为零。但是,RegexObject实例的搜索方法将扫描字符串,因此在这种情况下,匹配可能不会从零开始。

    >>> p = re.compile('[a-z]+')
    >>> print p.match('::: message')
    None
    >>> m = p.search('::: message') ; print m
    <re.MatchObject instance at 80c9650>
    >>> m.group()
    'message'
    >>> m.span()
    (4, 11)

    结合使用:

    在Python 2.2中,finditer()方法也可用,它返回一个MatchObject实例序列作为迭代器。

    >>> p = re.compile( ... )
    >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
    >>> iterator
    <callable-iterator object at 0x401833ac>
    >>> for match in iterator:
    ...     print match.span()
    ...
    (0, 2)
    (22, 24)
    (29, 31)

    您应该能够按以下顺序进行操作

    for match in re.finditer(r'[a-z]', 'a1b2c3d4'):
       print match.span()

    Taken from

    Regular Expression HOWTO

    span() returns both start and end indexes in a single tuple. Since the match method only checks if the RE matches at the start of a string, start() will always be zero. However, the search method of RegexObject instances scans through the string, so the match may not start at zero in that case.

    >>> p = re.compile('[a-z]+')
    >>> print p.match('::: message')
    None
    >>> m = p.search('::: message') ; print m
    <re.MatchObject instance at 80c9650>
    >>> m.group()
    'message'
    >>> m.span()
    (4, 11)
    

    Combine that with:

    In Python 2.2, the finditer() method is also available, returning a sequence of MatchObject instances as an iterator.

    >>> p = re.compile( ... )
    >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
    >>> iterator
    <callable-iterator object at 0x401833ac>
    >>> for match in iterator:
    ...     print match.span()
    ...
    (0, 2)
    (22, 24)
    (29, 31)
    

    you should be able to do something on the order of

    for match in re.finditer(r'[a-z]', 'a1b2c3d4'):
       print match.span()
    

    回答 2

    对于Python 3.x

    from re import finditer
    for match in finditer("pattern", "string"):
        print(match.span(), match.group())

    \n对于字符串中的每个匹配,您将获得独立的元组(分别包含匹配的第一个和最后一个索引)和匹配本身。

    For Python 3.x

    from re import finditer
    for match in finditer("pattern", "string"):
        print(match.span(), match.group())
    

    You shall get \n separated tuples (comprising first and last indices of the match, respectively) and the match itself, for each hit in the string.


    回答 3

    请注意,跨度和组在正则表达式中被索引为多个捕获组

    regex_with_3_groups=r"([a-z])([0-9]+)([A-Z])"
    for match in re.finditer(regex_with_3_groups, string):
        for idx in range(0, 4):
            print(match.span(idx), match.group(idx))

    note that the span & group are indexed for multi capture groups in a regex

    regex_with_3_groups=r"([a-z])([0-9]+)([A-Z])"
    for match in re.finditer(regex_with_3_groups, string):
        for idx in range(0, 4):
            print(match.span(idx), match.group(idx))
    

    python的re:如果字符串包含正则表达式模式,则返回True

    问题:python的re:如果字符串包含正则表达式模式,则返回True

    我有一个这样的正则表达式:

    regexp = u'ba[r|z|d]'

    如果单词包含barbazbad,则函数必须返回True 。简而言之,我需要python的regexp模拟

    'any-string' in 'text'

    我怎么知道呢?谢谢!

    I have a regular expression like this:

    regexp = u'ba[r|z|d]'
    

    Function must return True if word contains bar, baz or bad. In short, I need regexp analog for Python’s

    'any-string' in 'text'
    

    How can I realize it? Thanks!


    回答 0

    import re
    word = 'fubar'
    regexp = re.compile(r'ba[rzd]')
    if regexp.search(word):
      print 'matched'
    import re
    word = 'fubar'
    regexp = re.compile(r'ba[rzd]')
    if regexp.search(word):
      print 'matched'
    

    回答 1

    到目前为止最好的是

    bool(re.search('ba[rzd]', 'foobarrrr'))

    返回真

    The best one by far is

    bool(re.search('ba[rzd]', 'foobarrrr'))
    

    Returns True


    回答 2

    Match对象始终为true,None如果不匹配,则返回。只是测试真实性。

    码:

    >>> st = 'bar'
    >>> m = re.match(r"ba[r|z|d]",st)
    >>> if m:
    ...     m.group(0)
    ...
    'bar'

    输出= bar

    如果您想要search功能

    >>> st = "bar"
    >>> m = re.search(r"ba[r|z|d]",st)
    >>> if m is not None:
    ...     m.group(0)
    ...
    'bar'

    如果regexp找不到

    >>> st = "hello"
    >>> m = re.search(r"ba[r|z|d]",st)
    >>> if m:
    ...     m.group(0)
    ... else:
    ...   print "no match"
    ...
    no match

    如@bukzor所述,如果st = foo barthan match将不起作用。因此,它更适合使用re.search

    Match objects are always true, and None is returned if there is no match. Just test for trueness.

    Code:

    >>> st = 'bar'
    >>> m = re.match(r"ba[r|z|d]",st)
    >>> if m:
    ...     m.group(0)
    ...
    'bar'
    

    Output = bar

    If you want search functionality

    >>> st = "bar"
    >>> m = re.search(r"ba[r|z|d]",st)
    >>> if m is not None:
    ...     m.group(0)
    ...
    'bar'
    

    and if regexp not found than

    >>> st = "hello"
    >>> m = re.search(r"ba[r|z|d]",st)
    >>> if m:
    ...     m.group(0)
    ... else:
    ...   print "no match"
    ...
    no match
    

    As @bukzor mentioned if st = foo bar than match will not work. So, its more appropriate to use re.search.


    回答 3

    这是一个执行您想要的功能的函数:

    import re
    
    def is_match(regex, text):
        pattern = re.compile(regex, text)
        return pattern.search(text) is not None

    正则表达式搜索方法成功返回一个对象,如果在字符串中未找到模式,则返回None。考虑到这一点,只要搜索为我们提供了一些回报,我们就会返回True。

    例子:

    >>> is_match('ba[rzd]', 'foobar')
    True
    >>> is_match('ba[zrd]', 'foobaz')
    True
    >>> is_match('ba[zrd]', 'foobad')
    True
    >>> is_match('ba[zrd]', 'foobam')
    False

    Here’s a function that does what you want:

    import re
    
    def is_match(regex, text):
        pattern = re.compile(regex, text)
        return pattern.search(text) is not None
    

    The regular expression search method returns an object on success and None if the pattern is not found in the string. With that in mind, we return True as long as the search gives us something back.

    Examples:

    >>> is_match('ba[rzd]', 'foobar')
    True
    >>> is_match('ba[zrd]', 'foobaz')
    True
    >>> is_match('ba[zrd]', 'foobad')
    True
    >>> is_match('ba[zrd]', 'foobam')
    False
    

    回答 4

    您可以执行以下操作:

    如果搜索与您的搜索字符串匹配,则使用搜索将返回SRE_match对象。

    >>> import re
    >>> m = re.search(u'ba[r|z|d]', 'bar')
    >>> m
    <_sre.SRE_Match object at 0x02027288>
    >>> m.group()
    'bar'
    >>> n = re.search(u'ba[r|z|d]', 'bas')
    >>> n.group()

    如果没有,它将返回无

    Traceback (most recent call last):
      File "<pyshell#17>", line 1, in <module>
        n.group()
    AttributeError: 'NoneType' object has no attribute 'group'

    只是打印它以再次演示:

    >>> print n
    None

    You can do something like this:

    Using search will return a SRE_match object, if it matches your search string.

    >>> import re
    >>> m = re.search(u'ba[r|z|d]', 'bar')
    >>> m
    <_sre.SRE_Match object at 0x02027288>
    >>> m.group()
    'bar'
    >>> n = re.search(u'ba[r|z|d]', 'bas')
    >>> n.group()
    

    If not, it will return None

    Traceback (most recent call last):
      File "<pyshell#17>", line 1, in <module>
        n.group()
    AttributeError: 'NoneType' object has no attribute 'group'
    

    And just to print it to demonstrate again:

    >>> print n
    None