I have a function taking float arguments (generally integers or decimals with one significant digit), and I need to output the values in a string with two decimal places (5 -> 5.00, 5.5 -> 5.50, etc). How can I do this in Python?
这是Python 3.6中的新功能-照常将字符串放在引号中,并f'...以与r'...原始字符串相同的方式加上前缀。然后,将任何要放入字符串,变量,数字,大括号内的内容放入其中-Python会f'some string text with a {variable} or {number} within that text'像以前的字符串格式化方法那样进行求值,只是该方法更具可读性。
>>> a =3.141592>>>print(f'My number is {a:.2f} - look at the nice rounding!')My number is3.14- look at the nice rounding!
This was new in Python 3.6 – the string is placed in quotation marks as usual, prepended with f'... in the same way you would r'... for a raw string. Then you place whatever you want to put within your string, variables, numbers, inside braces f'some string text with a {variable} or {number} within that text' – and Python evaluates as with previous string formatting methods, except that this method is much more readable.
>>> foobar = 3.141592
>>> print(f'My number is {foobar:.2f} - look at the nice rounding!')
My number is 3.14 - look at the nice rounding!
You can see in this example we format with decimal places in similar fashion to previous string formatting methods.
NB foobar can be an number, variable, or even an expression eg f'{3*my_func(3.14):02f}'.
Going forward, with new code I prefer f-strings over common %s or str.format() methods as f-strings can be far more readable, and are often much faster.
which is strictly positional, and only comes with the caveat that format() arguments follow Python rules where unnamed args must come first, followed by named arguments, followed by *args (a sequence like list or tuple) and then *kwargs (a dict keyed with strings if you know what’s good for you).
The interpolation points are determined first by substituting the named values at their labels, and then positional from what’s left.
So, you can also do this…
Python 3.6 introduces literal string formatting, so that you can format the named parameters without any repeating any of your named parameters outside the string:
print(f'<a href="{my_url:s}">{my_url:s}</a>')
This will evaluate my_url, so if it’s not defined you will get a NameError. In fact, instead of my_url, you can write an arbitrary Python expression, as long as it evaluates to a string (because of the :s formatting code). If you want a string representation for the result of an expression that might not be a string, replace :s by !s, just like with regular, pre-literal string formatting.
For details on literal string formatting, see PEP 498, where it was first introduced.
回答 3
您将沉迷于语法。
同样是C#6.0,EcmaScript开发人员也熟悉此语法。
In[1]:print'{firstname} {lastname}'.format(firstname='Mehmet', lastname='Ağa')Mehmet Ağa
In[2]:print'{firstname} {lastname}'.format(**dict(firstname='Mehmet', lastname='Ağa'))Mehmet Ağa
Also C# 6.0, EcmaScript developers has also familier this syntax.
In [1]: print '{firstname} {lastname}'.format(firstname='Mehmet', lastname='Ağa')
Mehmet Ağa
In [2]: print '{firstname} {lastname}'.format(**dict(firstname='Mehmet', lastname='Ağa'))
Mehmet Ağa
As well as the dictionary way, it may be useful to know the following format:
print '<a href="%s">%s</a>' % (my_url, my_url)
Here it’s a tad redundant, and the dictionary way is certainly less error prone when modifying the code, but it’s still possible to use tuples for multiple insertions. The first %s is substituted for the first element in the tuple, the second %s is substituted for the second element in the tuple, and so on for each element in the tuple.
It works for both normal and unicode strings, since they both have a lower method.
In Python 2 it works for a mix of normal and unicode strings, since values of the two types can be compared with each other. Python 3 doesn’t work like that, though: you can’t compare a byte string and a unicode string, so in Python 3 you should do the sane thing and only sort lists of one type of string.
def sortCaseIns(lst):
lst2 =[[x for x in range(0,2)]for y in range(0, len(lst))]for i in range(0, len(lst)):
lst2[i][0]= lst[i].lower()
lst2[i][1]= lst[i]
lst2.sort()for i in range(0, len(lst)):
lst[i]= lst2[i][1]
def sortCaseIns(lst):
lst2 = [[x for x in range(0, 2)] for y in range(0, len(lst))]
for i in range(0, len(lst)):
lst2[i][0] = lst[i].lower()
lst2[i][1] = lst[i]
lst2.sort()
for i in range(0, len(lst)):
lst[i] = lst2[i][1]
Then you just can call this function:
sortCaseIns(yourListToSort)
回答 6
不区分大小写的排序,在Python 2 OR 3中对字符串进行排序(在Python 2.7.17和Python 3.6.9中测试):
>>> x =["aa","A","bb","B","cc","C"]>>> x.sort()>>> x
['A','B','C','aa','bb','cc']>>> x.sort(key=str.lower)# <===== there it is!>>> x
['A','aa','B','bb','C','cc']
# for Python2, ensure all elements are ASCII (NOT unicode) strings first
x =[str(element)for element in x]# for Python2, this sort will only work on ASCII (NOT unicode) strings
x.sort(key=str.lower)
Case-insensitive sort, sorting the string in place, in Python 2 OR 3 (tested in Python 2.7.17 and Python 3.6.9):
>>> x = ["aa", "A", "bb", "B", "cc", "C"]
>>> x.sort()
>>> x
['A', 'B', 'C', 'aa', 'bb', 'cc']
>>> x.sort(key=str.lower) # <===== there it is!
>>> x
['A', 'aa', 'B', 'bb', 'C', 'cc']
The key is key=str.lower. Here’s what those commands look like with just the commands, for easy copy-pasting so you can test them:
x = ["aa", "A", "bb", "B", "cc", "C"]
x.sort()
x
x.sort(key=str.lower)
x
Note that if your strings are unicode strings, however (like u'some string'), then in Python 2 only (NOT in Python 3 in this case) the above x.sort(key=str.lower) command will fail and output the following error:
TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'
If you get this error, then either upgrade to Python 3 where they handle unicode sorting, or convert your unicode strings to ASCII strings first, using a list comprehension, like this:
# for Python2, ensure all elements are ASCII (NOT unicode) strings first
x = [str(element) for element in x]
# for Python2, this sort will only work on ASCII (NOT unicode) strings
x.sort(key=str.lower)
I would like to create a string buffer to do lots of processing, format and finally write the buffer in a text file using a C-style sprintf functionality in Python. Because of conditional statements, I can’t write them directly to the file.
Edit, to clarify my question: buf is a big buffer contains all these strings which have formatted using sprintf.
Going by your examples, buf will only contain current values, not older ones.
e.g first in buf I wrote A= something ,B= something later C= something was appended in the same buf, but in your Python answers buf contains only last value, which is not I want – I want buf to have all the printfs I have done since the beginning, like in C.
回答 0
Python %为此提供了一个运算符。
>>> a =5>>> b ="hello">>> buf ="A = %d\n , B = %s\n"%(a, b)>>>print buf
A =5, B = hello
>>> c =10>>> buf ="C = %d\n"% c
>>>print buf
C =10
>>> a = 5
>>> b = "hello"
>>> buf = "A = %d\n , B = %s\n" % (a, b)
>>> print buf
A = 5
, B = hello
>>> c = 10
>>> buf = "C = %d\n" % c
>>> print buf
C = 10
See this reference for all supported format specifiers.
>>>importStringIO>>> buf =StringIO.StringIO()>>> buf.write("A = %d, B = %s\n"%(3,"bar"))>>> buf.write("C=%d\n"%5)>>>print(buf.getvalue())
A =3, B = bar
C=5
To insert into a very long string it is nice to use names for the different arguments, instead of hoping they are in the right positions. This also makes it easier to replace multiple recurrences.
This is probably the closest translation from your C code to Python code.
A = 1
B = "hello"
buf = "A = %d\n , B= %s\n" % (A, B)
c = 2
buf += "C=%d\n" % c
f = open('output.txt', 'w')
print >> f, c
f.close()
The % operator in Python does almost exactly the same thing as C’s sprintf. You can also print the string to a file directly. If there are lots of these string formatted stringlets involved, it might be wise to use a StringIO object to speed up processing time.
So instead of doing +=, do this:
import cStringIO
buf = cStringIO.StringIO()
...
print >> buf, "A = %d\n , B= %s\n" % (A, B)
...
print >> buf, "C=%d\n" % c
...
print >> f, buf.getvalue()
classReport:... usual init/enter/exit
defprint(self,*args,**kwargs):withStringIO()as s:print(*args,**kwargs, file=s)
out = s.getvalue()... stuff with out
withReport()as r:
r.print(f"This is {datetime.date.today()}!",'Yikes!', end=':')
Two approaches are to write to a string buffer or to write lines to a list and join them later. I think the StringIO approach is more pythonic, but didn’t work before Python 2.6.
from io import StringIO
with StringIO() as s:
print("Hello", file=s)
print("Goodbye", file=s)
# And later...
with open('myfile', 'w') as f:
f.write(s.getvalue())
You can also use these without a ContextMananger (s = StringIO()). Currently, I’m using a context manager class with a print function. This fragment might be useful to be able to insert debugging or odd paging requirements:
class Report:
... usual init/enter/exit
def print(self, *args, **kwargs):
with StringIO() as s:
print(*args, **kwargs, file=s)
out = s.getvalue()
... stuff with out
with Report() as r:
r.print(f"This is {datetime.date.today()}!", 'Yikes!', end=':')
I would like to read some characters from a string and put it into other string (Like we do in C).
So my code is like below
import string
import re
str = "Hello World"
j = 0
srr = ""
for i in str:
srr[j] = i #'str' object does not support item assignment
j = j + 1
print (srr)
# Copy the string
foo ='Hello'
bar = foo
# Create a new string by joining all characters of the old string
new_string =''.join(c for c in oldstring)# Slice and copy
new_string = oldstring[:]
As aix mentioned – strings in Python are immutable (you cannot change them inplace).
What you are trying to do can be done in many ways:
# Copy the string
foo = 'Hello'
bar = foo
# Create a new string by joining all characters of the old string
new_string = ''.join(c for c in oldstring)
# Slice and copy
new_string = oldstring[:]
回答 4
如果您想将特定字符换成另一个字符,则可以采用另一种方法:
def swap(input_string):if len(input_string)==0:return input_string
if input_string[0]=="x":return"y"+ swap(input_string[1:])else:return input_string[0]+ swap(input_string[1:])
import string
def main():
isbn = input("Enter your 10 digit ISBN number: ")if len(isbn)==10and string.digits ==True:print("Works")else:print("Error, 10 digit number was not inputted and/or letters were inputted.")
main()if __name__ =="__main__":
main()
input("Press enter to exit: ")
How do you check whether a string contains only numbers?
I’ve given it a go here. I’d like to see the simplest way to accomplish this.
import string
def main():
isbn = input("Enter your 10 digit ISBN number: ")
if len(isbn) == 10 and string.digits == True:
print ("Works")
else:
print("Error, 10 digit number was not inputted and/or letters were inputted.")
main()
if __name__ == "__main__":
main()
input("Press enter to exit: ")
Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.
If this needs to be avoided, the following simple function checks, if all characters in a string are a digit between “0” and “9”:
import string
def contains_only_digits(s):
# True for "", "0", "123"
# False for "1.2", "1,2", "-1", "a", "a1"
for ch in s:
if not ch in string.digits:
return False
return True
Used in the example from the question:
if len(isbn) == 10 and contains_only_digits(isbn):
print ("Works")
回答 6
您可以在此处使用try catch块:
s="1234"try:
num=int(s)print"S contains only digits"except:print"S doesn't contain digits ONLY"
As every time I encounter an issue with the check is because the str can be None sometimes, and if the str can be None, only use str.isdigit() is not enough as you will get an error
AttributeError: ‘NoneType’ object has no attribute ‘isdigit’
and then you need to first validate the str is None or not. To avoid a multi-if branch, a clear way to do this is:
if str and str.isdigit():
Hope this helps for people have the same issue like me.
# df['result'] = df['result'].str.split(r'\D').str[1]
df['result']= df['result'].str.split(r'\D').str.get(1)
df
time result
109:0052210:0062311:0044412:0030513:00110
import re
# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result']=[p.sub('', x)for x in df['result']]
df
time result
109:0052210:0062311:0044412:0030513:00110
该str.extract示例可以使用列表理解用来重写re.search,
p = re.compile(r'\d+')
df['result']=[p.search(x)[0]for x in df['result']]
df
time result
109:0052210:0062311:0044412:0030513:00110
def try_extract(pattern, string):try:
m = pattern.search(string)return m.group(0)except(TypeError,ValueError,AttributeError):return np.nan
p = re.compile(r'\d+')
df['result']=[try_extract(p, x)for x in df['result']]
df
time result
109:0052210:0062311:0044412:0030513:00110
我们还可以使用列表推导来重写@eumiro和@MonkeyButter的答案:
df['result']=[x.lstrip('+-').rstrip('aAbBcC')for x in df['result']]
def eumiro(df):return df.assign(
result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))def coder375(df):return df.assign(
result=df['result'].replace(r'\D', r'', regex=True))def monkeybutter(df):return df.assign(result=df['result'].map(lambda x: x[1:-1]))def wes(df):return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))def cs1(df):return df.assign(result=df['result'].str.replace(r'\D',''))def cs2_ted(df):# `str.extract` based solution, similar to @Ted Petrou's. so timing together.return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))def cs1_listcomp(df):return df.assign(result=[p1.sub('', x)for x in df['result']])def cs2_listcomp(df):return df.assign(result=[p2.search(x)[0]for x in df['result']])def cs_eumiro_listcomp(df):return df.assign(
result=[x.lstrip('+-').rstrip('aAbBcC')for x in df['result']])def cs_mb_listcomp(df):return df.assign(result=[x[1:-1]for x in df['result']])
How do I remove unwanted parts from strings in a column?
6 years after the original question was posted, pandas now has a good number of “vectorised” string functions that can succinctly perform these string manipulation operations.
This answer will explore some of these string functions, suggest faster alternatives, and go into a timings comparison at the end.
With extract, it is necessary to specify at least one capture group. expand=False will return a Series with the captured items from the first capture group.
Do not recommend if you are looking for a general solution.
If you are satisfied with the succinct and readable str
accessor-based solutions above, you can stop here. However, if you are
interested in faster, more performant alternatives, keep reading.
Optimizing: List Comprehensions
In some circumstances, list comprehensions should be favoured over pandas string functions. The reason is because string functions are inherently hard to vectorize (in the true sense of the word), so most string and regex functions are only wrappers around loops with more overhead.
The str.replace option can be re-written using re.sub
import re
# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df
time result
1 09:00 52
2 10:00 62
3 11:00 44
4 12:00 30
5 13:00 110
The str.extract example can be re-written using a list comprehension with re.search,
p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df
time result
1 09:00 52
2 10:00 62
3 11:00 44
4 12:00 30
5 13:00 110
If NaNs or no-matches are a possibility, you will need to re-write the above to include some error checking. I do this using a function.
def try_extract(pattern, string):
try:
m = pattern.search(string)
return m.group(0)
except (TypeError, ValueError, AttributeError):
return np.nan
p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df
time result
1 09:00 52
2 10:00 62
3 11:00 44
4 12:00 30
5 13:00 110
We can also re-write @eumiro’s and @MonkeyButter’s answers using list comprehensions:
df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]
Some of these comparisons are unfair because they take advantage of the structure of OP’s data, but take from it what you will. One thing to note is that every list comprehension function is either faster or comparable than its equivalent pandas variant.
Functions
def eumiro(df):
return df.assign(
result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))
def coder375(df):
return df.assign(
result=df['result'].replace(r'\D', r'', regex=True))
def monkeybutter(df):
return df.assign(result=df['result'].map(lambda x: x[1:-1]))
def wes(df):
return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))
def cs1(df):
return df.assign(result=df['result'].str.replace(r'\D', ''))
def cs2_ted(df):
# `str.extract` based solution, similar to @Ted Petrou's. so timing together.
return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))
def cs1_listcomp(df):
return df.assign(result=[p1.sub('', x) for x in df['result']])
def cs2_listcomp(df):
return df.assign(result=[p2.search(x)[0] for x in df['result']])
def cs_eumiro_listcomp(df):
return df.assign(
result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])
def cs_mb_listcomp(df):
return df.assign(result=[x[1:-1] for x in df['result']])
i’d use the pandas replace function, very simple and powerful as you can use regex. Below i’m using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.
In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:
A very simple method would be to use the extract method to select all the digits. Simply supply it the regular expression '\d+' which extracts any number of digits.
import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'],'result':['+52A','+62B','+44a','+30b','-110a']})%timeit data['result']= data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))10000 loops, best of 3:187µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'],'result':['+52A','+62B','+44a','+30b','-110a']})%timeit data['result']=[x.lstrip('+-').rstrip('aAbBcC')for x in data['result']]10000 loops, best of 3:117µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'],'result':['+52A','+62B','+44a','+30b','-110a']})%timeit data['result']= data['result'].str.lstrip('+-').str.rstrip('aAbBcC')1000 loops, best of 3:336µs per loop
I often use list comprehensions for these types of tasks because they’re often faster.
There can be big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest – see code race below for this task:
import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop
回答 7
假设您的DF在数字之间也有那些多余的字符。
result time
0+52A09:001+62B10:002+44a11:003+30b12:004-110a13:0053+b0 14:00
a list of about 750,000 “sentences” (long strings)
a list of about 20,000 “words” that I would like to delete from my 750,000 sentences
So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.
I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my “sentences”
import re
for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list
This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.
Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?
Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.
import re
import timeit
import random
with open('/usr/share/dict/american-english')as wordbook:
english_words =[word.strip().lower()for word in wordbook]
random.shuffle(english_words)print("First 10 words :")print(english_words[:10])
test_words =[("Surely not a word","#surely_NöTäWORD_so_regex_engine_can_return_fast"),("First word", english_words[0]),("Last word", english_words[-1]),("Almost a word","couldbeaword")]def find(word):def fun():return union.match(word)return fun
for exp in range(1,6):print("\nUnion of %d words"%10**exp)
union = re.compile(r"\b(%s)\b"%'|'.join(english_words[:10**exp]))for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000)*1000print(" %-17s : %.1fms"%(description, time))
它输出:
First10 words :["geritol's","sunstroke's",'fib','fergus','charms','canning','supervisor','fallaciously',"heritage's",'pastime']Union of 10 words
Surelynot a word :0.7msFirst word :0.8msLast word :0.7msAlmost a word :0.7msUnion of 100 words
Surelynot a word :0.7msFirst word :1.1msLast word :1.2msAlmost a word :1.2msUnion of 1000 words
Surelynot a word :0.7msFirst word :0.8msLast word :9.6msAlmost a word :10.1msUnion of 10000 words
Surelynot a word :1.4msFirst word :1.8msLast word :96.3msAlmost a word :116.6msUnion of 100000 words
Surelynot a word :0.7msFirst word :0.8msLast word :1227.1msAlmost a word :1404.1ms
Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP’s, it’s approximately 2000 times faster than the accepted answer.
If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.
Theory
If your sentences aren’t humongous strings, it’s probably feasible to process many more than 50 per second.
If you save all the banned words into a set, it will be very fast to check if another word is included in that set.
Pack the logic into a function, give this function as argument to re.sub and you’re done!
Code
import re
with open('/usr/share/dict/american-english') as wordbook:
banned_words = set(word.strip().lower() for word in wordbook)
def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return ""
else:
return word
sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
"GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000
word_pattern = re.compile('\w+')
for sentence in sentences:
sentence = word_pattern.sub(delete_banned_words, sentence)
the search is case-insensitive (thanks to lower())
replacing a word with "" might leave two spaces (as in your code)
With python3, \w+ also matches accented characters (e.g. "ångström").
Any non-word character (tab, space, newline, marks, …) will stay untouched.
Performance
There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.
In comparison, Liteye’s answer needed 160s for 10 thousand sentences.
With n being the total amound of words and m the amount of banned words, OP’s and Liteye’s code are O(n*m).
In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).
Regex union test
What’s the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?
It’s pretty hard to grasp the way the regex engine works, so let’s write a simple test.
This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :
one is clearly not a word (it begins with #)
one is the first word in the list
one is the last word in the list
one looks like a word but isn’t
import re
import timeit
import random
with open('/usr/share/dict/american-english') as wordbook:
english_words = [word.strip().lower() for word in wordbook]
random.shuffle(english_words)
print("First 10 words :")
print(english_words[:10])
test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", english_words[0]),
("Last word", english_words[-1]),
("Almost a word", "couldbeaword")
]
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nUnion of %d words" % 10**exp)
union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %-17s : %.1fms" % (description, time))
It outputs:
First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']
Union of 10 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 0.7ms
Almost a word : 0.7ms
Union of 100 words
Surely not a word : 0.7ms
First word : 1.1ms
Last word : 1.2ms
Almost a word : 1.2ms
Union of 1000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 9.6ms
Almost a word : 10.1ms
Union of 10000 words
Surely not a word : 1.4ms
First word : 1.8ms
Last word : 96.3ms
Almost a word : 116.6ms
Union of 100000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 1227.1ms
Almost a word : 1404.1ms
So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:
O(1) best case
O(n/2) average case, which is still O(n)
O(n) worst case
These results are consistent with a simple loop search.
import re
classTrie():"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""def __init__(self):
self.data ={}def add(self, word):
ref = self.data
for char in word:
ref[char]= char in ref and ref[char]or{}
ref = ref[char]
ref['']=1def dump(self):return self.data
def quote(self, char):return re.escape(char)def _pattern(self, pData):
data = pData
if""in data and len(data.keys())==1:returnNone
alt =[]
cc =[]
q =0for char in sorted(data.keys()):if isinstance(data[char], dict):try:
recurse = self._pattern(data[char])
alt.append(self.quote(char)+ recurse)except:
cc.append(self.quote(char))else:
q =1
cconly =not len(alt)>0if len(cc)>0:if len(cc)==1:
alt.append(cc[0])else:
alt.append('['+''.join(cc)+']')if len(alt)==1:
result = alt[0]else:
result ="(?:"+"|".join(alt)+")"if q:if cconly:
result +="?"else:
result ="(?:%s)?"% result
return result
def pattern(self):return self._pattern(self.dump())
# Encoding: utf-8import re
import timeit
import random
from trie importTriewith open('/usr/share/dict/american-english')as wordbook:
banned_words =[word.strip().lower()for word in wordbook]
random.shuffle(banned_words)
test_words =[("Surely not a word","#surely_NöTäWORD_so_regex_engine_can_return_fast"),("First word", banned_words[0]),("Last word", banned_words[-1]),("Almost a word","couldbeaword")]def trie_regex_from_words(words):
trie =Trie()for word in words:
trie.add(word)return re.compile(r"\b"+ trie.pattern()+ r"\b", re.IGNORECASE)def find(word):def fun():return union.match(word)return fun
for exp in range(1,6):print("\nTrieRegex of %d words"%10**exp)
union = trie_regex_from_words(banned_words[:10**exp])for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000)*1000print(" %s : %.1fms"%(description, time))
它输出:
TrieRegex of 10 words
Surelynot a word :0.3msFirst word :0.4msLast word :0.5msAlmost a word :0.5msTrieRegex of 100 words
Surelynot a word :0.3msFirst word :0.5msLast word :0.9msAlmost a word :0.6msTrieRegex of 1000 words
Surelynot a word :0.3msFirst word :0.7msLast word :0.9msAlmost a word :1.1msTrieRegex of 10000 words
Surelynot a word :0.1msFirst word :1.0msLast word :1.2msAlmost a word :1.2msTrieRegex of 100000 words
Surelynot a word :0.3msFirst word :1.2msLast word :0.9msAlmost a word :1.6ms
Use this method if you want the fastest regex-based solution. For a dataset similar to the OP’s, it’s approximately 1000 times faster than the accepted answer.
If you don’t care about regex, use this set-based version, which is 2000 times faster than a regex union.
It’s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren’t really human-readable, but they do allow for very fast lookup and match.
The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn’t match), instead of trying the 5 words. It’s a preprocess overkill for 5 words, but it shows promising results for many thousand words.
foo(bar|baz) would save unneeded information to a capturing group.
Code
Here’s a slightly modified gist, which we can use as a trie.py library:
import re
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
# Encoding: utf-8
import re
import timeit
import random
from trie import Trie
with open('/usr/share/dict/american-english') as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)
test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[-1]),
("Almost a word", "couldbeaword")
]
def trie_regex_from_words(words):
trie = Trie()
for word in words:
trie.add(word)
return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nTrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %s : %.1fms" % (description, time))
It outputs:
TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms
TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms
TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms
TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms
TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms
One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.
This should be faster, because to process a sentence, you just have to step through each of the words and check if it’s a match.
Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then “discarding” the result of this work before the next pass.
#! /bin/env python3# -*- coding: utf-8import time, random, re
def replace1( sentences ):for n, sentence in enumerate( sentences ):for search, repl in patterns:
sentence = re.sub("\\b"+search+"\\b", repl, sentence )def replace2( sentences ):for n, sentence in enumerate( sentences ):for search, repl in patterns_comp:
sentence = re.sub( search, repl, sentence )def replace3( sentences ):
pd = patterns_dict.get
for n, sentence in enumerate( sentences ):#~ print( n, sentence )# Split the sentence on non-word characters.# Note: () in split patterns ensure the non-word characters ARE kept# and returned in the result list, so we don't mangle the sentence.# If ALL separators are spaces, use string.split instead or something.# Example:#~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")#~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
words = re.split(r"([^\w]+)", sentence)# and... done.
sentence ="".join( pd(w,w)for w in words )#~ print( n, sentence )def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
w = m.group()return pd(w,w)for n, sentence in enumerate( sentences ):
sentence = re.sub(r"\w+", repl, sentence)# Build test set
test_words =[("word%d"% _)for _ in range(50000)]
test_sentences =[" ".join( random.sample( test_words,10))for _ in range(1000)]# Create search and replace patterns
patterns =[(("word%d"% _),("repl%d"% _))for _ in range(20000)]
patterns_dict = dict( patterns )
patterns_comp =[(re.compile("\\b"+search+"\\b"), repl)for search, repl in patterns ]def test( func, num ):
t = time.time()
func( test_sentences[:num])print("%30s: %.02f sentences/s"%(func.__name__, num/(time.time()-t)))print("Sentences", len(test_sentences))print("Words ", len(test_words))
test( replace1,1)
test( replace2,10)
test( replace3,1000)
test( replace4,1000)
Well, here’s a quick and easy solution, with test set.
Winning strategy:
re.sub(“\w+”,repl,sentence) searches for words.
“repl” can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.
This is the simplest and fastest solution (see function replace4 in example code below).
Second best
The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.
Perhaps Python is not the right tool here. Here is one with the Unix toolchain
sed G file |
tr ' ' '\n' |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'
assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.
This should run at least an order of magnitude faster.
For preprocessing the blacklist file from words (one word per line)
sed 's/.*/\\b&\\b/' words > blacklist
回答 6
这个怎么样:
#!/usr/bin/env python3from __future__ import unicode_literals, print_function
import re
import time
import io
def replace_sentences_1(sentences, banned_words):# faster on CPython, but does not use \b as the word separator# so result is slightly different than replace_sentences_2()def filter_sentence(sentence):
words = WORD_SPLITTER.split(sentence)
words_iter = iter(words)for word in words_iter:
norm_word = word.lower()if norm_word notin banned_words:yield word
yield next(words_iter)# yield the word separator
WORD_SPLITTER = re.compile(r'(\W+)')
banned_words = set(banned_words)for sentence in sentences:yield''.join(filter_sentence(sentence))def replace_sentences_2(sentences, banned_words):# slower on CPython, uses \b as separatordef filter_sentence(sentence):
boundaries = WORD_BOUNDARY.finditer(sentence)
current_boundary =0whileTrue:
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()yield sentence[last_word_boundary:current_boundary]# yield the separators
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
word = sentence[last_word_boundary:current_boundary]
norm_word = word.lower()if norm_word notin banned_words:yield word
WORD_BOUNDARY = re.compile(r'\b')
banned_words = set(banned_words)for sentence in sentences:yield''.join(filter_sentence(sentence))
corpus = io.open('corpus2.txt').read()
banned_words =[l.lower()for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt','wb')print('number of sentences:', len(sentences))
start = time.time()for sentence in replace_sentences_1(sentences, banned_words):
output.write(sentence.encode('utf-8'))
output.write(b' .')print('time:', time.time()- start)
$ # replace_sentences_1()
$ python3 filter_words.py
number of sentences:862462
time:24.46173644065857
$ pypy filter_words.py
number of sentences:862462
time:15.9370770454
$ # replace_sentences_2()
$ python3 filter_words.py
number of sentences:862462
time:40.2742919921875
$ pypy filter_words.py
number of sentences:862462
time:13.1190629005
#!/usr/bin/env python3
from __future__ import unicode_literals, print_function
import re
import time
import io
def replace_sentences_1(sentences, banned_words):
# faster on CPython, but does not use \b as the word separator
# so result is slightly different than replace_sentences_2()
def filter_sentence(sentence):
words = WORD_SPLITTER.split(sentence)
words_iter = iter(words)
for word in words_iter:
norm_word = word.lower()
if norm_word not in banned_words:
yield word
yield next(words_iter) # yield the word separator
WORD_SPLITTER = re.compile(r'(\W+)')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
def replace_sentences_2(sentences, banned_words):
# slower on CPython, uses \b as separator
def filter_sentence(sentence):
boundaries = WORD_BOUNDARY.finditer(sentence)
current_boundary = 0
while True:
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
yield sentence[last_word_boundary:current_boundary] # yield the separators
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
word = sentence[last_word_boundary:current_boundary]
norm_word = word.lower()
if norm_word not in banned_words:
yield word
WORD_BOUNDARY = re.compile(r'\b')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
output.write(sentence.encode('utf-8'))
output.write(b' .')
print('time:', time.time() - start)
These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes’ solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn’t compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don’t care about it, it should be fairly straightforward to remove them from the output.
I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu’s wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I’ve defined sentences as anything separated by “. “.
$ # replace_sentences_1()
$ python3 filter_words.py
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py
number of sentences: 862462
time: 15.9370770454
$ # replace_sentences_2()
$ python3 filter_words.py
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py
number of sentences: 862462
time: 13.1190629005
PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.
A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.
With join/split tricks you can avoid loops at all which should speed up the algorithm.
Concatenate a sentences with a special delimeter which is not contained by the sentences:
merged_sentences = ' * '.join(sentences)
Compile a single regex for all the words you need to rid from the sentences using | “or” regex statement:
regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
"".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:
for (i = 0; i < seqlen; i++) {
[...]
sz += PyUnicode_GET_LENGTH(item);
Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.
b.t.w. don’t use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.
Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here’s one) to locate all your “bad” words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
As of Python 3.7 re.escape() was changed to escape only characters which are meaningful to regex operations.
In the search pattern, include \ as well as the character(s) you’re looking for.
You’re going to be using \ to escape your characters, so you need to escape
that as well.
Put parentheses around the search pattern, e.g. ([\"]), so that the substitution
pattern can use the found character when it adds \ in front of it. (That’s what
\1 does: uses the value of the first parenthesized group.)
The r in front of r'([\"])' means it’s a raw string. Raw strings use different
rules for escaping backslashes. To write ([\"]) as a plain string, you’d need to
double all the backslashes and write '([\\"])'. Raw strings are friendlier when
you’re writing regular expressions.
In the substitution pattern, you need to escape \ to distinguish it from a
backslash that precedes a substitution group, e.g. \1, hence r'\\\1'. To write
that as a plain string, you’d need '\\\\\\1' — and nobody wants that.
Use repr()[1:-1]. In this case, the double quotes don’t need to be escaped. The [-1:1] slice is to remove the single quote from the beginning and the end.
>>> x = raw_input()
I'm "stuck" :\
>>> print x
I'm "stuck" :\
>>> print repr(x)[1:-1]
I\'m "stuck" :\\
Or maybe you just want to escape a phrase to paste into your program? If so, do this:
>>> escape =lambda s, escapechar, specialchars:"".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)>>> s = raw_input()
I'm "stuck" :\
>>> print s
I'm "stuck":\
>>>print escape(s,"\\",['"'])
I'm \"stuck\" :\\
As it was mentioned above, the answer depends on your case. If you want to escape a string for a regular expression then you should use re.escape(). But if you want to escape a specific set of characters then use this lambda function:
>>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
>>> s = raw_input()
I'm "stuck" :\
>>> print s
I'm "stuck" :\
>>> print escape(s, "\\", ['"'])
I'm \"stuck\" :\\
回答 4
这并不难:
def escapeSpecialCharacters ( text, characters ):for character in characters:
text = text.replace( character,'\\'+ character )return text
>>> escapeSpecialCharacters('I\'m "stuck" :\\','\'"')'I\\\'m \\"stuck\\" :\\'>>>print( _ )
I\'m \"stuck\" :\
So I’ve spent way to much time on this, and it seems to me like it should be a simple fix. I’m trying to use Facebook’s Authentication to register users on my site, and I’m trying to do it server side. I’ve gotten to the point where I get my access token, and when I go to:
It seems like I should just be able to use dict(string) on this but I’m getting this error:
ValueError: dictionary update sequence element #0 has length 1; 2 is required
So I tried using Pickle, but got this error:
KeyError: '{'
I tried using django.serializers to de-serialize it but had similar results. Any thoughts? I feel like the answer has to be simple, and I’m just being stupid. Thanks for any help!
import json # or `import simplejson as json` if on Python < 2.6
json_string = u'{ "id":"123456789", ... }'
obj = json.loads(json_string)# obj now contains a dict of the data
This data is JSON! You can deserialize it using the built-in json module if you’re on Python 2.6+, otherwise you can use the excellent third-party simplejson module.
import json # or `import simplejson as json` if on Python < 2.6
json_string = u'{ "id":"123456789", ... }'
obj = json.loads(json_string) # obj now contains a dict of the data