Use () in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn’t find the result, so don’t use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)
Note that starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it’s possible to improve a bit on Krzysztof Krasoń’s solution by capturing the match result directly within the if condition as a variable and re-use it in the condition’s body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello
回答 2
尝试使用捕获组:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
… assuming that your text (HTML) is in a variable named “text.”
This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.
However …
Don’t use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you’re going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.
If your handling “real world” tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn’t in the standard libraries (yet) but is wide recommended for this purpose.
Another option is: lxml … which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
>>> p = re.compile("name (.*) is valid")>>> result = p.search(s)>>> result
<_sre.SRE_Match object at 0x10555e738>>>> result.group(1)# group(1) will return the 1st capture.# group(0) will returned the entire matched text.'my_user_name'
You need to capture from regex. search for the pattern, if found, retrieve the string using group(index). Assuming valid checks are performed:
>>> p = re.compile("name (.*) is valid")
>>> result = p.search(s)
>>> result
<_sre.SRE_Match object at 0x10555e738>
>>> result.group(1) # group(1) will return the 1st capture (stuff within the brackets).
# group(0) will returned the entire matched text.
'my_user_name'
回答 1
您可以使用匹配组:
p = re.compile('name (.*) is valid')
例如
>>>import re
>>> p = re.compile('name (.*) is valid')>>> s ="""
... someline abc
... someother line
... name my_user_name is valid
... some more lines""">>> p.findall(s)['my_user_name']
>>> p.search(s)#gives a match object or None if no match is found<_sre.SRE_Match object at 0xf5c60>>>> p.search(s).group()#entire string that matched'name my_user_name is valid'>>> p.search(s).group(1)#first group that match in the string that matched'my_user_name'
如评论中所述,您可能希望使正则表达式不贪心:
p = re.compile('name (.*?) is valid')
只能提取到'name '下一个之间的内容' is valid'(而不是让您的正则表达式来提取' is valid'组中的其他内容。
>>> import re
>>> p = re.compile('name (.*) is valid')
>>> s = """
... someline abc
... someother line
... name my_user_name is valid
... some more lines"""
>>> p.findall(s)
['my_user_name']
Here I use re.findall rather than re.search to get all instances of my_user_name. Using re.search, you’d need to get the data from the group on the match object:
>>> p.search(s) #gives a match object or None if no match is found
<_sre.SRE_Match object at 0xf5c60>
>>> p.search(s).group() #entire string that matched
'name my_user_name is valid'
>>> p.search(s).group(1) #first group that match in the string that matched
'my_user_name'
As mentioned in the comments, you might want to make your regex non-greedy:
p = re.compile('name (.*?) is valid')
to only pick up the stuff between 'name ' and the next ' is valid' (rather than allowing your regex to pick up other ' is valid' in your group.
回答 2
您可以使用如下形式:
import re
s =#that big string# the parenthesis create a group with what was matched# and '\w' matches only alphanumeric charactes
p = re.compile("name +(\w+) +is valid", re.flags)# use search(), so the match doesn't have to happen # at the beginning of "big string"
m = p.search(s)# search() returns a Match object with information about what was matchedif m:
name = m.group(1)else:raiseException('name not found')
import re
s = #that big string
# the parenthesis create a group with what was matched
# and '\w' matches only alphanumeric charactes
p = re.compile("name +(\w+) +is valid", re.flags)
# use search(), so the match doesn't have to happen
# at the beginning of "big string"
m = p.search(s)
# search() returns a Match object with information about what was matched
if m:
name = m.group(1)
else:
raise Exception('name not found')
回答 3
也许这更短一些,更容易理解:
import re
text ='... someline abc... someother line... name my_user_name is valid.. some more lines'>>> re.search('name (.*) is valid', text).group(1)'my_user_name'
Maybe that’s a bit shorter and easier to understand:
import re
text = '... someline abc... someother line... name my_user_name is valid.. some more lines'
>>> re.search('name (.*) is valid', text).group(1)
'my_user_name'
>>>import re
>>> s ='name my_user_name is valid'>>> match = re.search('name (.*) is valid', s)>>> match.group(0)# the entire match'name my_user_name is valid'>>> match.group(1)# the first parenthesized subgroup'my_user_name'
You can use groups (indicated with '(' and ')') to capture parts of the string. The match object’s group() method then gives you the group’s contents:
>>> import re
>>> s = 'name my_user_name is valid'
>>> match = re.search('name (.*) is valid', s)
>>> match.group(0) # the entire match
'name my_user_name is valid'
>>> match.group(1) # the first parenthesized subgroup
'my_user_name'
In Python 3.6+ you can also index into a match object instead of using group():
>>> match[0] # the entire match
'name my_user_name is valid'
>>> match[1] # the first parenthesized subgroup
'my_user_name'
string ='''someline abc\n
someother line\n
name my_user_name is valid\n
some more lines\n'''
pattern = r'name (?P<user>.*) is valid'
matches = re.search(pattern, str(string), re.DOTALL)print(matches['user'])# my_user_name
## covert s to list of strings separated by line:
s2 = s.splitlines()## find matches by line: for i, j in enumerate(s2):
matches = re.finditer("name (.*) is valid", j)## ignore lines without a matchif matches:## loop through match group elementsfor k in matches:## get text
match_txt = k.group(0)## get line span
match_span = k.span(0)## extract username
my_user_name = match_txt[5:-9]## compare with original textprint(f'Extracted Username: {my_user_name} - found on line {i}')print('Match Text:', match_txt)
It seems like you’re actually trying to extract a name vice simply find a match. If this is the case, having span indexes for your match is helpful and I’d recommend using re.finditer. As a shortcut, you know the name part of your regex is length 5 and the is valid is length 9, so you can slice the matching text to extract the name.
Note – In your example, it looks like s is string with line breaks, so that’s what’s assumed below.
## covert s to list of strings separated by line:
s2 = s.splitlines()
## find matches by line:
for i, j in enumerate(s2):
matches = re.finditer("name (.*) is valid", j)
## ignore lines without a match
if matches:
## loop through match group elements
for k in matches:
## get text
match_txt = k.group(0)
## get line span
match_span = k.span(0)
## extract username
my_user_name = match_txt[5:-9]
## compare with original text
print(f'Extracted Username: {my_user_name} - found on line {i}')
print('Match Text:', match_txt)
a list of about 750,000 “sentences” (long strings)
a list of about 20,000 “words” that I would like to delete from my 750,000 sentences
So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.
I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my “sentences”
import re
for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list
This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.
Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?
Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.
import re
import timeit
import random
with open('/usr/share/dict/american-english')as wordbook:
english_words =[word.strip().lower()for word in wordbook]
random.shuffle(english_words)print("First 10 words :")print(english_words[:10])
test_words =[("Surely not a word","#surely_NöTäWORD_so_regex_engine_can_return_fast"),("First word", english_words[0]),("Last word", english_words[-1]),("Almost a word","couldbeaword")]def find(word):def fun():return union.match(word)return fun
for exp in range(1,6):print("\nUnion of %d words"%10**exp)
union = re.compile(r"\b(%s)\b"%'|'.join(english_words[:10**exp]))for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000)*1000print(" %-17s : %.1fms"%(description, time))
它输出:
First10 words :["geritol's","sunstroke's",'fib','fergus','charms','canning','supervisor','fallaciously',"heritage's",'pastime']Union of 10 words
Surelynot a word :0.7msFirst word :0.8msLast word :0.7msAlmost a word :0.7msUnion of 100 words
Surelynot a word :0.7msFirst word :1.1msLast word :1.2msAlmost a word :1.2msUnion of 1000 words
Surelynot a word :0.7msFirst word :0.8msLast word :9.6msAlmost a word :10.1msUnion of 10000 words
Surelynot a word :1.4msFirst word :1.8msLast word :96.3msAlmost a word :116.6msUnion of 100000 words
Surelynot a word :0.7msFirst word :0.8msLast word :1227.1msAlmost a word :1404.1ms
Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP’s, it’s approximately 2000 times faster than the accepted answer.
If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.
Theory
If your sentences aren’t humongous strings, it’s probably feasible to process many more than 50 per second.
If you save all the banned words into a set, it will be very fast to check if another word is included in that set.
Pack the logic into a function, give this function as argument to re.sub and you’re done!
Code
import re
with open('/usr/share/dict/american-english') as wordbook:
banned_words = set(word.strip().lower() for word in wordbook)
def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return ""
else:
return word
sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
"GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000
word_pattern = re.compile('\w+')
for sentence in sentences:
sentence = word_pattern.sub(delete_banned_words, sentence)
the search is case-insensitive (thanks to lower())
replacing a word with "" might leave two spaces (as in your code)
With python3, \w+ also matches accented characters (e.g. "ångström").
Any non-word character (tab, space, newline, marks, …) will stay untouched.
Performance
There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.
In comparison, Liteye’s answer needed 160s for 10 thousand sentences.
With n being the total amound of words and m the amount of banned words, OP’s and Liteye’s code are O(n*m).
In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).
Regex union test
What’s the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?
It’s pretty hard to grasp the way the regex engine works, so let’s write a simple test.
This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :
one is clearly not a word (it begins with #)
one is the first word in the list
one is the last word in the list
one looks like a word but isn’t
import re
import timeit
import random
with open('/usr/share/dict/american-english') as wordbook:
english_words = [word.strip().lower() for word in wordbook]
random.shuffle(english_words)
print("First 10 words :")
print(english_words[:10])
test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", english_words[0]),
("Last word", english_words[-1]),
("Almost a word", "couldbeaword")
]
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nUnion of %d words" % 10**exp)
union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %-17s : %.1fms" % (description, time))
It outputs:
First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']
Union of 10 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 0.7ms
Almost a word : 0.7ms
Union of 100 words
Surely not a word : 0.7ms
First word : 1.1ms
Last word : 1.2ms
Almost a word : 1.2ms
Union of 1000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 9.6ms
Almost a word : 10.1ms
Union of 10000 words
Surely not a word : 1.4ms
First word : 1.8ms
Last word : 96.3ms
Almost a word : 116.6ms
Union of 100000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 1227.1ms
Almost a word : 1404.1ms
So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:
O(1) best case
O(n/2) average case, which is still O(n)
O(n) worst case
These results are consistent with a simple loop search.
import re
classTrie():"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""def __init__(self):
self.data ={}def add(self, word):
ref = self.data
for char in word:
ref[char]= char in ref and ref[char]or{}
ref = ref[char]
ref['']=1def dump(self):return self.data
def quote(self, char):return re.escape(char)def _pattern(self, pData):
data = pData
if""in data and len(data.keys())==1:returnNone
alt =[]
cc =[]
q =0for char in sorted(data.keys()):if isinstance(data[char], dict):try:
recurse = self._pattern(data[char])
alt.append(self.quote(char)+ recurse)except:
cc.append(self.quote(char))else:
q =1
cconly =not len(alt)>0if len(cc)>0:if len(cc)==1:
alt.append(cc[0])else:
alt.append('['+''.join(cc)+']')if len(alt)==1:
result = alt[0]else:
result ="(?:"+"|".join(alt)+")"if q:if cconly:
result +="?"else:
result ="(?:%s)?"% result
return result
def pattern(self):return self._pattern(self.dump())
# Encoding: utf-8import re
import timeit
import random
from trie importTriewith open('/usr/share/dict/american-english')as wordbook:
banned_words =[word.strip().lower()for word in wordbook]
random.shuffle(banned_words)
test_words =[("Surely not a word","#surely_NöTäWORD_so_regex_engine_can_return_fast"),("First word", banned_words[0]),("Last word", banned_words[-1]),("Almost a word","couldbeaword")]def trie_regex_from_words(words):
trie =Trie()for word in words:
trie.add(word)return re.compile(r"\b"+ trie.pattern()+ r"\b", re.IGNORECASE)def find(word):def fun():return union.match(word)return fun
for exp in range(1,6):print("\nTrieRegex of %d words"%10**exp)
union = trie_regex_from_words(banned_words[:10**exp])for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000)*1000print(" %s : %.1fms"%(description, time))
它输出:
TrieRegex of 10 words
Surelynot a word :0.3msFirst word :0.4msLast word :0.5msAlmost a word :0.5msTrieRegex of 100 words
Surelynot a word :0.3msFirst word :0.5msLast word :0.9msAlmost a word :0.6msTrieRegex of 1000 words
Surelynot a word :0.3msFirst word :0.7msLast word :0.9msAlmost a word :1.1msTrieRegex of 10000 words
Surelynot a word :0.1msFirst word :1.0msLast word :1.2msAlmost a word :1.2msTrieRegex of 100000 words
Surelynot a word :0.3msFirst word :1.2msLast word :0.9msAlmost a word :1.6ms
Use this method if you want the fastest regex-based solution. For a dataset similar to the OP’s, it’s approximately 1000 times faster than the accepted answer.
If you don’t care about regex, use this set-based version, which is 2000 times faster than a regex union.
It’s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren’t really human-readable, but they do allow for very fast lookup and match.
The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn’t match), instead of trying the 5 words. It’s a preprocess overkill for 5 words, but it shows promising results for many thousand words.
foo(bar|baz) would save unneeded information to a capturing group.
Code
Here’s a slightly modified gist, which we can use as a trie.py library:
import re
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
# Encoding: utf-8
import re
import timeit
import random
from trie import Trie
with open('/usr/share/dict/american-english') as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)
test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[-1]),
("Almost a word", "couldbeaword")
]
def trie_regex_from_words(words):
trie = Trie()
for word in words:
trie.add(word)
return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nTrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %s : %.1fms" % (description, time))
It outputs:
TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms
TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms
TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms
TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms
TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms
One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.
This should be faster, because to process a sentence, you just have to step through each of the words and check if it’s a match.
Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then “discarding” the result of this work before the next pass.
#! /bin/env python3# -*- coding: utf-8import time, random, re
def replace1( sentences ):for n, sentence in enumerate( sentences ):for search, repl in patterns:
sentence = re.sub("\\b"+search+"\\b", repl, sentence )def replace2( sentences ):for n, sentence in enumerate( sentences ):for search, repl in patterns_comp:
sentence = re.sub( search, repl, sentence )def replace3( sentences ):
pd = patterns_dict.get
for n, sentence in enumerate( sentences ):#~ print( n, sentence )# Split the sentence on non-word characters.# Note: () in split patterns ensure the non-word characters ARE kept# and returned in the result list, so we don't mangle the sentence.# If ALL separators are spaces, use string.split instead or something.# Example:#~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")#~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
words = re.split(r"([^\w]+)", sentence)# and... done.
sentence ="".join( pd(w,w)for w in words )#~ print( n, sentence )def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
w = m.group()return pd(w,w)for n, sentence in enumerate( sentences ):
sentence = re.sub(r"\w+", repl, sentence)# Build test set
test_words =[("word%d"% _)for _ in range(50000)]
test_sentences =[" ".join( random.sample( test_words,10))for _ in range(1000)]# Create search and replace patterns
patterns =[(("word%d"% _),("repl%d"% _))for _ in range(20000)]
patterns_dict = dict( patterns )
patterns_comp =[(re.compile("\\b"+search+"\\b"), repl)for search, repl in patterns ]def test( func, num ):
t = time.time()
func( test_sentences[:num])print("%30s: %.02f sentences/s"%(func.__name__, num/(time.time()-t)))print("Sentences", len(test_sentences))print("Words ", len(test_words))
test( replace1,1)
test( replace2,10)
test( replace3,1000)
test( replace4,1000)
Well, here’s a quick and easy solution, with test set.
Winning strategy:
re.sub(“\w+”,repl,sentence) searches for words.
“repl” can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.
This is the simplest and fastest solution (see function replace4 in example code below).
Second best
The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.
Perhaps Python is not the right tool here. Here is one with the Unix toolchain
sed G file |
tr ' ' '\n' |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'
assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.
This should run at least an order of magnitude faster.
For preprocessing the blacklist file from words (one word per line)
sed 's/.*/\\b&\\b/' words > blacklist
回答 6
这个怎么样:
#!/usr/bin/env python3from __future__ import unicode_literals, print_function
import re
import time
import io
def replace_sentences_1(sentences, banned_words):# faster on CPython, but does not use \b as the word separator# so result is slightly different than replace_sentences_2()def filter_sentence(sentence):
words = WORD_SPLITTER.split(sentence)
words_iter = iter(words)for word in words_iter:
norm_word = word.lower()if norm_word notin banned_words:yield word
yield next(words_iter)# yield the word separator
WORD_SPLITTER = re.compile(r'(\W+)')
banned_words = set(banned_words)for sentence in sentences:yield''.join(filter_sentence(sentence))def replace_sentences_2(sentences, banned_words):# slower on CPython, uses \b as separatordef filter_sentence(sentence):
boundaries = WORD_BOUNDARY.finditer(sentence)
current_boundary =0whileTrue:
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()yield sentence[last_word_boundary:current_boundary]# yield the separators
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
word = sentence[last_word_boundary:current_boundary]
norm_word = word.lower()if norm_word notin banned_words:yield word
WORD_BOUNDARY = re.compile(r'\b')
banned_words = set(banned_words)for sentence in sentences:yield''.join(filter_sentence(sentence))
corpus = io.open('corpus2.txt').read()
banned_words =[l.lower()for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt','wb')print('number of sentences:', len(sentences))
start = time.time()for sentence in replace_sentences_1(sentences, banned_words):
output.write(sentence.encode('utf-8'))
output.write(b' .')print('time:', time.time()- start)
$ # replace_sentences_1()
$ python3 filter_words.py
number of sentences:862462
time:24.46173644065857
$ pypy filter_words.py
number of sentences:862462
time:15.9370770454
$ # replace_sentences_2()
$ python3 filter_words.py
number of sentences:862462
time:40.2742919921875
$ pypy filter_words.py
number of sentences:862462
time:13.1190629005
#!/usr/bin/env python3
from __future__ import unicode_literals, print_function
import re
import time
import io
def replace_sentences_1(sentences, banned_words):
# faster on CPython, but does not use \b as the word separator
# so result is slightly different than replace_sentences_2()
def filter_sentence(sentence):
words = WORD_SPLITTER.split(sentence)
words_iter = iter(words)
for word in words_iter:
norm_word = word.lower()
if norm_word not in banned_words:
yield word
yield next(words_iter) # yield the word separator
WORD_SPLITTER = re.compile(r'(\W+)')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
def replace_sentences_2(sentences, banned_words):
# slower on CPython, uses \b as separator
def filter_sentence(sentence):
boundaries = WORD_BOUNDARY.finditer(sentence)
current_boundary = 0
while True:
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
yield sentence[last_word_boundary:current_boundary] # yield the separators
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
word = sentence[last_word_boundary:current_boundary]
norm_word = word.lower()
if norm_word not in banned_words:
yield word
WORD_BOUNDARY = re.compile(r'\b')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
output.write(sentence.encode('utf-8'))
output.write(b' .')
print('time:', time.time() - start)
These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes’ solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn’t compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don’t care about it, it should be fairly straightforward to remove them from the output.
I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu’s wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I’ve defined sentences as anything separated by “. “.
$ # replace_sentences_1()
$ python3 filter_words.py
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py
number of sentences: 862462
time: 15.9370770454
$ # replace_sentences_2()
$ python3 filter_words.py
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py
number of sentences: 862462
time: 13.1190629005
PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.
A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.
With join/split tricks you can avoid loops at all which should speed up the algorithm.
Concatenate a sentences with a special delimeter which is not contained by the sentences:
merged_sentences = ' * '.join(sentences)
Compile a single regex for all the words you need to rid from the sentences using | “or” regex statement:
regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
"".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:
for (i = 0; i < seqlen; i++) {
[...]
sz += PyUnicode_GET_LENGTH(item);
Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.
b.t.w. don’t use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.
Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here’s one) to locate all your “bad” words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
As of Python 3.7 re.escape() was changed to escape only characters which are meaningful to regex operations.
In the search pattern, include \ as well as the character(s) you’re looking for.
You’re going to be using \ to escape your characters, so you need to escape
that as well.
Put parentheses around the search pattern, e.g. ([\"]), so that the substitution
pattern can use the found character when it adds \ in front of it. (That’s what
\1 does: uses the value of the first parenthesized group.)
The r in front of r'([\"])' means it’s a raw string. Raw strings use different
rules for escaping backslashes. To write ([\"]) as a plain string, you’d need to
double all the backslashes and write '([\\"])'. Raw strings are friendlier when
you’re writing regular expressions.
In the substitution pattern, you need to escape \ to distinguish it from a
backslash that precedes a substitution group, e.g. \1, hence r'\\\1'. To write
that as a plain string, you’d need '\\\\\\1' — and nobody wants that.
Use repr()[1:-1]. In this case, the double quotes don’t need to be escaped. The [-1:1] slice is to remove the single quote from the beginning and the end.
>>> x = raw_input()
I'm "stuck" :\
>>> print x
I'm "stuck" :\
>>> print repr(x)[1:-1]
I\'m "stuck" :\\
Or maybe you just want to escape a phrase to paste into your program? If so, do this:
>>> escape =lambda s, escapechar, specialchars:"".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)>>> s = raw_input()
I'm "stuck" :\
>>> print s
I'm "stuck":\
>>>print escape(s,"\\",['"'])
I'm \"stuck\" :\\
As it was mentioned above, the answer depends on your case. If you want to escape a string for a regular expression then you should use re.escape(). But if you want to escape a specific set of characters then use this lambda function:
>>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
>>> s = raw_input()
I'm "stuck" :\
>>> print s
I'm "stuck" :\
>>> print escape(s, "\\", ['"'])
I'm \"stuck\" :\\
回答 4
这并不难:
def escapeSpecialCharacters ( text, characters ):for character in characters:
text = text.replace( character,'\\'+ character )return text
>>> escapeSpecialCharacters('I\'m "stuck" :\\','\'"')'I\\\'m \\"stuck\\" :\\'>>>print( _ )
I\'m \"stuck\" :\
string replace() function perfectly solves this problem:
string.replace(s, old, new[, maxreplace])
Return a copy of string s with all occurrences of substring old replaced by new. If the optional argument maxreplace is given, the first maxreplace occurrences are replaced.
Ignacio Vazquez-Abrams is correct. But to elaborate, re.match() will return either None, which evaluates to False, or a match object, which will always be True as he said. Only if you want information about the part(s) that matched your regular expression do you need to check out the contents of the match object.
回答 3
一种方法是仅针对返回值进行测试。因为您得到<_sre.SRE_Match object at ...>它意味着它将评估为true。当正则表达式不匹配时,您将返回无值,其结果为false。
One way to do this is just to test against the return value. Because you’re getting <_sre.SRE_Match object at ...> it means that this will evaluate to true. When the regular expression isn’t matched you’ll the return value None, which evaluates to false.
import re
if re.search("c", "abcdef"):
print "hi"
Produces hi as output.
回答 4
这是我的方法:
import re
# Compile
p = re.compile(r'hi')# Match and printprint bool(p.match("abcdefghijkl"))
You can use re.match() or re.search().
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default). refer this
If your problem is really just this simple, you don’t need regex:
s[s.find("(")+1:s.find(")")]
回答 1
用途re.search(r'\((.*?)\)',s).group(1):
>>>import re
>>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'>>> re.search(r'\((.*?)\)',s).group(1)
u"date='2/xc2/xb2',time='/case/test.png'"
>>> import re
>>> s = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'
>>> re.search(r'\((.*?)\)',s).group(1)
u"date='2/xc2/xb2',time='/case/test.png'"
Building on tkerwin’s answer, if you happen to have nested parentheses like in
st = "sum((a+b)/(c+d))"
his answer will not work if you need to take everything between the first opening parenthesis and the last closing parenthesis to get (a+b)/(c+d), because find searches from the left of the string, and would stop at the first closing parenthesis.
To fix that, you need to use rfind for the second part of the operation, so it would become
st[st.find("(")+1:st.rfind(")")]
回答 4
import re
fancy = u'abcde(date=\'2/xc2/xb2\',time=\'/case/test.png\')'print re.compile("\((.*)\)").search( fancy ).group(1)
How can I get the start and end positions of all matches using the re module? For example given the pattern r'[a-z]' and the string 'a1b2c3d4' I’d want to get the positions where it finds each letter. Ideally, I’d like to get the text of the match back too.
回答 0
import re
p = re.compile("[a-z]")for m in p.finditer('a1b2c3d4'):print(m.start(), m.group())
>>> p = re.compile('[a-z]+')>>>print p.match('::: message')None>>> m = p.search('::: message');print m
<re.MatchObject instance at 80c9650>>>> m.group()'message'>>> m.span()(4,11)
span() returns both start and end indexes in a single tuple. Since the
match method only checks if the RE matches at the start of a string,
start() will always be zero. However, the search method of RegexObject
instances scans through the string, so the match may not start at zero
in that case.
>>> p = re.compile('[a-z]+')
>>> print p.match('::: message')
None
>>> m = p.search('::: message') ; print m
<re.MatchObject instance at 80c9650>
>>> m.group()
'message'
>>> m.span()
(4, 11)
Combine that with:
In Python 2.2, the finditer() method is also available, returning a sequence of MatchObject instances as an iterator.
>>> p = re.compile( ... )
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator
<callable-iterator object at 0x401833ac>
>>> for match in iterator:
... print match.span()
...
(0, 2)
(22, 24)
(29, 31)
you should be able to do something on the order of
for match in re.finditer(r'[a-z]', 'a1b2c3d4'):
print match.span()
回答 2
对于Python 3.x
from re import finditer
for match in finditer("pattern","string"):print(match.span(), match.group())
from re import finditer
for match in finditer("pattern", "string"):
print(match.span(), match.group())
You shall get \n separated tuples (comprising first and last indices of the match, respectively) and the match itself, for each hit in the string.
回答 3
请注意,跨度和组在正则表达式中被索引为多个捕获组
regex_with_3_groups=r"([a-z])([0-9]+)([A-Z])"for match in re.finditer(regex_with_3_groups, string):for idx in range(0,4):print(match.span(idx), match.group(idx))
note that the span & group are indexed for multi capture groups in a regex
regex_with_3_groups=r"([a-z])([0-9]+)([A-Z])"
for match in re.finditer(regex_with_3_groups, string):
for idx in range(0, 4):
print(match.span(idx), match.group(idx))
import re
def is_match(regex, text):
pattern = re.compile(regex, text)
return pattern.search(text) is not None
The regular expression search method returns an object on success and None if the pattern is not found in the string. With that in mind, we return True as long as the search gives us something back.
>>>import re
>>> m = re.search(u'ba[r|z|d]','bar')>>> m
<_sre.SRE_Match object at 0x02027288>>>> m.group()'bar'>>> n = re.search(u'ba[r|z|d]','bas')>>> n.group()
如果没有,它将返回无
Traceback(most recent call last):File"<pyshell#17>", line 1,in<module>
n.group()AttributeError:'NoneType' object has no attribute 'group'
Using search will return a SRE_match object, if it matches your search string.
>>> import re
>>> m = re.search(u'ba[r|z|d]', 'bar')
>>> m
<_sre.SRE_Match object at 0x02027288>
>>> m.group()
'bar'
>>> n = re.search(u'ba[r|z|d]', 'bas')
>>> n.group()
If not, it will return None
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
n.group()
AttributeError: 'NoneType' object has no attribute 'group'