标签归档:replace

除非分配输出,为什么调用Python字符串方法不做任何事情?

问题:除非分配输出,为什么调用Python字符串方法不做任何事情?

我尝试做一个简单的字符串替换,但是我不知道为什么它似乎不起作用:

X = "hello world"
X.replace("hello", "goodbye")

我想将单词更改hellogoodbye,因此应将字符串更改"hello world""goodbye world"。但是X仍然存在"hello world"。为什么我的代码不起作用?

I try to do a simple string replacement, but I don’t know why it doesn’t seem to work:

X = "hello world"
X.replace("hello", "goodbye")

I want to change the word hello to goodbye, thus it should change the string "hello world" to "goodbye world". But X just remains "hello world". Why is my code not working?


回答 0

这是因为字符串在Python中是不可变的

这意味着将X.replace("hello","goodbye")返回的副本,X其中包含已替换的副本。因此,您需要替换此行:

X.replace("hello", "goodbye")

用这一行:

X = X.replace("hello", "goodbye")

更广泛地说,这是所有Python字符串的方法是“就地”修改字符串的内容真实,例如replacestriptranslatelower/ upperjoin,…

如果要使用它而不要将其丢弃,则必须将其输出分配给某些东西。

X  = X.strip(' \t')
X2 = X.translate(...)
Y  = X.lower()
Z  = X.upper()
A  = X.join(':')
B  = X.capitalize()
C  = X.casefold()

等等。

This is because strings are immutable in Python.

Which means that X.replace("hello","goodbye") returns a copy of X with replacements made. Because of that you need replace this line:

X.replace("hello", "goodbye")

with this line:

X = X.replace("hello", "goodbye")

More broadly, this is true for all Python string methods that change a string’s content “in-place”, e.g. replace,strip,translate,lower/upper,join,…

You must assign their output to something if you want to use it and not throw it away, e.g.

X  = X.strip(' \t')
X2 = X.translate(...)
Y  = X.lower()
Z  = X.upper()
A  = X.join(':')
B  = X.capitalize()
C  = X.casefold()

and so on.


回答 1

所有的字符串功能lowerupperstrip而没有修改原始返回的字符串。如果您尝试修改一个字符串(可能会想到)well it is an iterable,它将失败。

x = 'hello'
x[0] = 'i' #'str' object does not support item assignment

关于字符串不可更改的重要性的文章很好读:为什么Python字符串不可更改?使用它们的最佳实践

All string functions as lower, upper, strip are returning a string without modifying the original. If you try to modify a string, as you might think well it is an iterable, it will fail.

x = 'hello'
x[0] = 'i' #'str' object does not support item assignment

There is a good reading about the importance of strings being immutable: Why are Python strings immutable? Best practices for using them


在Python 3中加速数百万个正则表达式的替换

问题:在Python 3中加速数百万个正则表达式的替换

我正在使用Python 3.5.2

我有两个清单

  • 大约750,000个“句子”(长字符串)的列表
  • 我想从我的750,000个句子中删除的大约20,000个“单词”的列表

因此,我必须遍历750,000个句子并执行大约20,000个替换,但前提是我的单词实际上是“单词”,并且不属于较大的字符串。

我这样做是通过预编译我的单词,使它们位于\b元字符的侧面

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

然后我遍历我的“句子”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

这个嵌套循环每秒处理大约50个句子,这很好,但是处理我所有的句子仍需要几个小时。

  • 有没有一种方法可以使用该str.replace方法(我认为该方法更快),但仍然要求仅在单词边界处进行替换?

  • 或者,有没有办法加快该re.sub方法?re.sub如果单词的长度大于句子的长度,我已经略微提高了速度,但这并没有太大的改进。

感谢您的任何建议。

I’m using Python 3.5.2

I have two lists

  • a list of about 750,000 “sentences” (long strings)
  • a list of about 20,000 “words” that I would like to delete from my 750,000 sentences

So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my “sentences”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.

Thank you for any suggestions.


回答 0

您可以尝试做的一件事是编译一个单一模式,例如"\b(word1|word2|word3)\b"

由于re依靠C代码进行实际匹配,因此节省的费用可观。

正如@pvg在评论中指出的,它也受益于单遍匹配。

如果您的单词不是正则表达式,那么Eric的答案会更快。

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric’s answer is faster.


回答 1

TLDR

如果您想要最快的解决方案,请使用此方法(带有设置的查找)。对于类似于OP的数据集,它比接受的答案快大约2000倍。

如果您坚持使用正则表达式进行查找,请使用此基于Trie的版本,该版本仍比正则表达式联合快1000倍。

理论

如果您的句子不是笨拙的字符串,每秒处理50个以上的句子可能是可行的。

如果将所有禁止的单词保存到集合中,则可以非常快速地检查该集合中是否包含另一个单词。

将逻辑打包到一个函数中,将此函数作为参数提供给re.sub您,您就完成了!

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

转换后的句子为:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

注意:

  • 搜索不区分大小写(感谢lower()
  • 用替换一个单词""可能会留下两个空格(如您的代码中所示)
  • 使用python3,\w+还可以匹配带重音符号的字符(例如"ångström")。
  • 任何非单词字符(制表符,空格,换行符,标记等)都将保持不变。

性能

一百万个句子,banned_words近十万个单词,脚本运行时间不到7秒。

相比之下,Liteye的答案需要1万个句子需要160秒。

由于n是单词的总数和m被禁止的单词的数量,OP和Liteye的代码为O(n*m)

相比之下,我的代码应在中运行O(n+m)。考虑到句子比禁止词多得多,该算法变为O(n)

正则表达式联合测试

使用'\b(word1|word2|...|wordN)\b'模式进行正则表达式搜索的复杂性是什么?是O(N)还是O(1)

很难了解正则表达式引擎的工作方式,因此让我们编写一个简单的测试。

此代码将10**i随机的英语单词提取到列表中。它创建相应的正则表达式联合,并用不同的词对其进行测试:

  • 一个人显然不是一个词(以开头#
  • 一个是列表中的第一个单词
  • 一个是列表中的最后一个单词
  • 一个看起来像一个单词,但不是


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

它输出:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

因此,看起来像一个带有'\b(word1|word2|...|wordN)\b'模式的单词的搜索具有:

  • O(1) 最好的情况
  • O(n/2) 一般情况,仍然 O(n)
  • O(n) 最糟糕的情况

这些结果与简单的循环搜索一致。

regex联合的一种更快的替代方法是从trie创建regex模式

TLDR

Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP’s, it’s approximately 2000 times faster than the accepted answer.

If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

Theory

If your sentences aren’t humongous strings, it’s probably feasible to process many more than 50 per second.

If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

Pack the logic into a function, give this function as argument to re.sub and you’re done!

Code

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

Converted sentences are:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

Note that:

  • the search is case-insensitive (thanks to lower())
  • replacing a word with "" might leave two spaces (as in your code)
  • With python3, \w+ also matches accented characters (e.g. "ångström").
  • Any non-word character (tab, space, newline, marks, …) will stay untouched.

Performance

There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

In comparison, Liteye’s answer needed 160s for 10 thousand sentences.

With n being the total amound of words and m the amount of banned words, OP’s and Liteye’s code are O(n*m).

In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

Regex union test

What’s the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

It’s pretty hard to grasp the way the regex engine works, so let’s write a simple test.

This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

  • one is clearly not a word (it begins with #)
  • one is the first word in the list
  • one is the last word in the list
  • one looks like a word but isn’t


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

It outputs:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

  • O(1) best case
  • O(n/2) average case, which is still O(n)
  • O(n) worst case

These results are consistent with a simple loop search.

A much faster alternative to a regex union is to create the regex pattern from a trie.


回答 2

TLDR

如果您想要最快的基于正则表达式的解决方案,请使用此方法。对于类似于OP的数据集,它比接受的答案快大约1000倍。

如果您不关心正则表达式,请使用此基于集合的版本,它比正则表达式联合快2000倍。

使用Trie优化正则表达式

一个简单的正则表达式工会的做法与许多禁用词语变得缓慢,这是因为正则表达式引擎不会做了很好的工作优化格局。

可以使用所有禁止的单词创建Trie并编写相应的正则表达式。生成的trie或regex并不是真正的人类可读的,但是它们确实允许非常快速的查找和匹配。

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

该列表将转换为特里:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

然后到此正则表达式模式:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

巨大的优势在于,要测试是否zoo匹配,正则表达式引擎只需比较第一个字符(不匹配),而无需尝试5个单词。这是5个单词的预处理过大杀伤力,但它显示了成千上万个单词的有希望的结果。

请注意,使用(?:)非捕获组是因为:

  • foobar|baz将匹配foobarbaz但不匹配foobaz
  • foo(bar|baz)将不需要的信息保存到捕获组

这是一个经过稍微修改的gist,我们可以将其用作trie.py库:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

测试

这是一个小测试(与测试相同):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

它输出:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

对于信息,正则表达式开始如下:

(?:a(?:(?:\’s | a(?:\’s | chen | liyah(?:\’s)?| r(?:dvark(?:(?:\’s | s ))?|| on))| b(?:\’s | a(?:c(?:us(?:(?:\’s | es))?| [ik])| ft | lone(? :(?:\’s | s))?| ndon(?:( ?: ed | ing | ment(?:\’s)?| s))?| s(?:e(?:( ?: ment(?:\’s)?| [ds]))?| h(?:( ?: e [ds] | ing))?| ing)| t(?:e(?:( ?: ment( ?:\’s)?| [ds]))?| ing | toir(?:(?:\’s | s))?))| b(?:as(?:id)?| e(? :ss(?:(?:\’s | es))?| y(?:(?:\’s | s))?)| ot(?:(?:\’s | t(?:\ ‘s)?| s))?| reviat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))| y(?:\’ s)?| \é(?:(?:\’s | s))?)| d(?:icat(?:e [ds]?| i(?:ng | on(?:(?:\ ‘s | s))?)))| om(?:en(?:(?:\’s | s))?| inal)| u(?:ct(?:( ?: ed | i(?: ng | on(?:(?:\’s | s))?)|或(?:(?:\’s | s))?| s))?| l(?:\’s)?) )| e(?:(?:\’s | am | l(?:(?:\’s | ard | son(?:\’s)?)))?| r(?:deen(?:\ ‘s)?| nathy(?:\’s)?| ra(?:nt | tion(?:(?:\’s | s))?))| t(?:( ?: t(?: e(?:r(?:(?:\’s | s))?| d)| ing | or(?:(?:\’s | s))?)| s))?| yance(?:\’s)?| d))?| hor(?:( ?: r(?:e(?:n(?:ce(? :\’s)?| t)| d)| ing)| s)))| i(?:d(?:e [ds]?| ing | jan(?:\’s)?)|盖尔| l(?:ene | it(?:ies | y(?:\’s)?)))| j(?:ect(?:ly)?| ur(?:ation(?:(?:\’ s | s))?| e [ds]?| ing))| l(?:a(?:tive(?:(?:\’s | s))?| ze)| e(?:(? :st | r))?| oom | ution(?:(?:\’s | s))?| y)| m \’s | n(?:e(?:gat(?:e [ds] || i(?:ng | on(?:\’s)?))| r(?:\’s)?)| ormal(?:( ?: it(?:ies | y(?:\’ s)?)| ly))?)| o(?:ard | de(?:(?:\’s | s))?| li(?:sh(?:( ?: e [ds] | ing ))|| tion(?:(?:\’s | ist(?:(?:\’s | s))?))?)| mina(?:bl [ey] | t(?:e [ ds]?| i(?:ng | on(?:(?:\’s | s))?))))| r(?:igin(?:al(?:(?:\’s | s) )?| e(?:(?:\’s | s))?)| t(?:( ?: ed | i(?:ng | on(?:(?:\’s | ist(?: (?:\’s | s))?| s))?| ve)| s))))| u(?:nd(?:(?:( ?: ed | ing | s |))?| t)| ve (?:(?:\’s | board))?)| r(?:a(?:cadabra(?:\’s)?| d(?:e [ds]?| ing)| ham(? :\’s)?| m(?:(?:\’s | s))?| si(?:on(?:(?:\’s | s))?| ve(?:( ?:\’s | ly | ness(?:\’s)?| s))?))| east | idg(?:e(?:( ?: ment(?:((?:\’s | s))) ?| [ds]))?| ing | ment(?:(?:\’s | s))?)| o(?:ad | gat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))))| upt(?:( ?: e(?:st | r)| ly | ness(?:\’s)?))?)) | s(?:alom | c(?:ess(?:(?:\’s | e [ds] | ing)))?| issa(?:(?:\’s | [es])))?| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?:( ?: e(?:e( ?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))))| inth(?:(?:\’s | e( ?:o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))?| i(?:on(?: \’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?: e(?:n(? :cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti …s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。

这确实让人难以理解,但是对于100000个禁用词的列表而言,此Trie regex比简单的regex联合快1000倍!

这是完整的trie的图,并通过trie-python-graphviz和graphviz 导出twopi

TLDR

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP’s, it’s approximately 1000 times faster than the accepted answer.

If you don’t care about regex, use this set-based version, which is 2000 times faster than a regex union.

Optimized Regex with Trie

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn’t do a very good job of optimizing the pattern.

It’s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren’t really human-readable, but they do allow for very fast lookup and match.

Example

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

The list is converted to a trie:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

And then to this regex pattern:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn’t match), instead of trying the 5 words. It’s a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that (?:) non-capturing groups are used because:

Code

Here’s a slightly modified gist, which we can use as a trie.py library:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

Test

Here’s a small test (the same as this one):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

It outputs:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

For info, the regex begins like this:

(?:a(?:(?:\’s|a(?:\’s|chen|liyah(?:\’s)?|r(?:dvark(?:(?:\’s|s))?|on))|b(?:\’s|a(?:c(?:us(?:(?:\’s|es))?|[ik])|ft|lone(?:(?:\’s|s))?|ndon(?:(?:ed|ing|ment(?:\’s)?|s))?|s(?:e(?:(?:ment(?:\’s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\’s)?|[ds]))?|ing|toir(?:(?:\’s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\’s|es))?|y(?:(?:\’s|s))?)|ot(?:(?:\’s|t(?:\’s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|y(?:\’s)?|\é(?:(?:\’s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|om(?:en(?:(?:\’s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\’s|s))?)|or(?:(?:\’s|s))?|s))?|l(?:\’s)?))|e(?:(?:\’s|am|l(?:(?:\’s|ard|son(?:\’s)?))?|r(?:deen(?:\’s)?|nathy(?:\’s)?|ra(?:nt|tion(?:(?:\’s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\’s|s))?|d)|ing|or(?:(?:\’s|s))?)|s))?|yance(?:\’s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\’s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\’s)?)|gail|l(?:ene|it(?:ies|y(?:\’s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\’s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\’s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\’s|s))?|y)|m\’s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\’s)?))|r(?:\’s)?)|ormal(?:(?:it(?:ies|y(?:\’s)?)|ly))?)|o(?:ard|de(?:(?:\’s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\’s|ist(?:(?:\’s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|r(?:igin(?:al(?:(?:\’s|s))?|e(?:(?:\’s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\’s|ist(?:(?:\’s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\’s|board))?)|r(?:a(?:cadabra(?:\’s)?|d(?:e[ds]?|ing)|ham(?:\’s)?|m(?:(?:\’s|s))?|si(?:on(?:(?:\’s|s))?|ve(?:(?:\’s|ly|ness(?:\’s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\’s|s))?|[ds]))?|ing|ment(?:(?:\’s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\’s)?))?)|s(?:alom|c(?:ess(?:(?:\’s|e[ds]|ing))?|issa(?:(?:\’s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\’s|s))?|t(?:(?:e(?:e(?:(?:\’s|ism(?:\’s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\’s|e(?:\’s)?))?|o(?:l(?:ut(?:e(?:(?:\’s|ly|st?))?|i(?:on(?:\’s)?|sm(?:\’s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\’s)?|t(?:(?:\’s|s))?)|d)|ing|s))?|pti…

It’s really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

Here’s a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:


回答 3

您可能想尝试的一件事是对句子进行预处理以对单词边界进行编码。基本上,通过划分单词边界将每个句子变成单词列表。

这应该更快,因为要处理一个句子,您只需要逐步检查每个单词并检查它是否匹配即可。

当前,正则表达式搜索每次必须再次遍历整个字符串,以查找单词边界,然后在下一次遍历之前“舍弃”这项工作的结果。

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.

This should be faster, because to process a sentence, you just have to step through each of the words and check if it’s a match.

Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then “discarding” the result of this work before the next pass.


回答 4

好吧,这是一个快速简单的解决方案,带有测试仪。

取胜策略:

re.sub(“ \ w +”,repl,sentence)搜索单词。

“ repl”可以是可调用的。我使用了一个执行字典查找的函数,该字典包含要搜索和替换的单词。

这是最简单,最快的解决方案(请参见下面的示例代码中的函数replace4)。

次好的

想法是使用re.split将句子拆分为单词,同时保留分隔符以稍后重建句子。然后,通过简单的字典查找完成替换。

(请参见下面的示例代码中的函数replace3)。

功能示例的时间:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…和代码:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

编辑:检查是否传递小写的句子列表并编辑repl时,您也可以忽略小写

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

Well, here’s a quick and easy solution, with test set.

Winning strategy:

re.sub(“\w+”,repl,sentence) searches for words.

“repl” can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.

This is the simplest and fastest solution (see function replace4 in example code below).

Second best

The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.

(see function replace3 in example code below).

Timings for example functions:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…and code:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

Edit: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

回答 5

也许Python不是这里的正确工具。这是Unix工具链中的一个

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

假设您的黑名单文件已经过预处理,并添加了字边界。步骤是:将文件转换为双倍行距,将每个句子拆分为每行一个单词,从文件中批量删除黑名单单词,然后合并回行。

这应该至少快一个数量级。

用于从单词中预处理黑名单文件(每行一个单词)

sed 's/.*/\\b&\\b/' words > blacklist

Perhaps Python is not the right tool here. Here is one with the Unix toolchain

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

This should run at least an order of magnitude faster.

For preprocessing the blacklist file from words (one word per line)

sed 's/.*/\\b&\\b/' words > blacklist

回答 6

这个怎么样:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

这些解决方案在单词边界上划分并查找集合中的每个单词。它们应该比re.sub单词替代(Liteyes的解决方案)更快,因为这些解决方案是O(n),其中n是由于amortized O(1)设置查找而导致的,而使用正则表达式替代项将导致regex引擎必须检查单词是否匹配在每个字符上,而不仅仅是在单词边界上。我的解决方案a格外小心,以保留原始文本中使用的空格(即,它不压缩空格,并保留制表符,换行符和其他空格字符),但是如果您决定不关心它,则可以从输出中删除它们应该非常简单。

我在corpus.txt上进行了测试,corpus.txt是从Gutenberg Project下载的多本电子书的串联,并且banned_words.txt是从Ubuntu的单词表(/ usr / share / dict / american-english)中随机选择的20000个单词。处理862462个句子(约占PyPy的一半)大约需要30秒。我已将句子定义为以“。”分隔的任何内容。

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy特别受益于第二种方法,而CPython在第一种方法上表现更好。上面的代码在Python 2和Python 3上都可以使用。

How about this:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes’ solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn’t compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don’t care about it, it should be fairly straightforward to remove them from the output.

I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu’s wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I’ve defined sentences as anything separated by “. “.

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.


回答 7

实用方法

下述解决方案使用大量内存将所有文本存储在同一字符串中,并降低了复杂度。如果RAM是一个问题,请在使用前三思。

使用join/ split技巧,您可以完全避免循环,从而可以加快算法的速度。

  • 用特殊分隔符连接句子,这些特殊分隔符不包含在句子中:
  • merged_sentences = ' * '.join(sentences)

  • 使用|“或”正则表达式语句为需要从句子中摆脱的所有单词编译一个正则表达式:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

  • 用已编译的正则表达式对单词下标,并用特殊的分隔符将其拆分回单独的句子:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

    性能

    "".join复杂度为O(n)。这是非常直观的,但是无论如何都会有一个简短的报价来源:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);

    因此,join/split有了O(words)+ 2 * O(sentences)仍然是线性复杂度,而初始方法为2 * O(N 2)。


    顺便说一句,不要使用多线程。GIL将阻止每个操作,因为您的任务严格地受CPU限制,因此GIL没有机会被释放,但是每个线程将同时发送滴答声,这会导致额外的工作量,甚至导致操作达到无穷大。

    Practical approach

    A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.

    With join/split tricks you can avoid loops at all which should speed up the algorithm.

  • Concatenate a sentences with a special delimeter which is not contained by the sentences:
  • merged_sentences = ' * '.join(sentences)
    

  • Compile a single regex for all the words you need to rid from the sentences using | “or” regex statement:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
    

  • Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
    

    Performance

    "".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);
    

    Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.


    b.t.w. don’t use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.


    回答 8

    将所有句子连接到一个文档中。使用Aho-Corasick算法的任何实现(这里是)来查找所有“不好”的单词。遍历文件,替换每个坏词,更新后跟的发现词的偏移量等。

    Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here’s one) to locate all your “bad” words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.


    有条件替换熊猫

    问题:有条件替换熊猫

    我有一个DataFrame,我想用超过零的值替换特定列中的值。我以为这是实现此目标的一种方式:

    df[df.my_channel > 20000].my_channel = 0

    如果将通道复制到新的数据框中,这很简单:

    df2 = df.my_channel 
    
    df2[df2 > 20000] = 0
    

    这完全符合我的要求,但似乎无法与通道一起用作原始DataFrame的一部分。

    I have a DataFrame, and I want to replace the values in a particular column that exceed a value with zero. I had thought this was a way of achieving this:

    df[df.my_channel > 20000].my_channel = 0
    

    If I copy the channel into a new data frame it’s simple:

    df2 = df.my_channel 
    
    df2[df2 > 20000] = 0
    

    This does exactly what I want, but seems not to work with the channel as part of the original DataFrame.


    回答 0

    .ixindexer可以在0.20.0之前的熊猫版本上正常工作,但是由于pandas为0.20.0 ,因此不推荐使用.ix indexer ,因此应避免使用它。而是可以使用或索引器。您可以通过以下方法解决此问题:.lociloc

    mask = df.my_channel > 20000
    column_name = 'my_channel'
    df.loc[mask, column_name] = 0
    

    或者,一行

    df.loc[df.my_channel > 20000, 'my_channel'] = 0

    mask帮助您选择这些行df.my_channel > 20000True,而df.loc[mask, column_name] = 0将值0到所选择的行,其中mask在其名称是列存放column_name

    更新: 在这种情况下,应该使用,loc因为如果使用iloc,则会NotImplementedError告诉您基于iLocation的基于整数类型的布尔索引不可用

    .ix indexer works okay for pandas version prior to 0.20.0, but since pandas 0.20.0, the .ix indexer is deprecated, so you should avoid using it. Instead, you can use .loc or iloc indexers. You can solve this problem by:

    mask = df.my_channel > 20000
    column_name = 'my_channel'
    df.loc[mask, column_name] = 0
    

    Or, in one line,

    df.loc[df.my_channel > 20000, 'my_channel'] = 0
    

    mask helps you to select the rows in which df.my_channel > 20000 is True, while df.loc[mask, column_name] = 0 sets the value 0 to the selected rows where maskholds in the column which name is column_name.

    Update: In this case, you should use loc because if you use iloc, you will get a NotImplementedError telling you that iLocation based boolean indexing on an integer type is not available.


    回答 1

    尝试

    df.loc[df.my_channel > 20000, 'my_channel'] = 0

    注: 由于v0.20.0,ix 已被弃用,赞成loc/ iloc

    Try

    df.loc[df.my_channel > 20000, 'my_channel'] = 0
    

    Note: Since v0.20.0, ix has been deprecated in favour of loc / iloc.


    回答 2

    np.where 功能如下:

    df['X'] = np.where(df['Y']>=50, 'yes', 'no')

    在您的情况下,您需要:

    import numpy as np
    df['my_channel'] = np.where(df.my_channel > 20000, 0, df.my_channel)
    

    np.where function works as follows:

    df['X'] = np.where(df['Y']>=50, 'yes', 'no')
    

    In your case you would want:

    import numpy as np
    df['my_channel'] = np.where(df.my_channel > 20000, 0, df.my_channel)
    

    回答 3

    原始数据框不更新的原因是,链接索引可能会导致您修改副本而不是数据框的视图。该文档提供了以下建议:

    在熊猫对象中设置值时,必须注意避免所谓的链接索引。

    您有几种选择:-

    loc +布尔索引

    loc 可以用于设置值并支持布尔掩码:

    df.loc[df['my_channel'] > 20000, 'my_channel'] = 0

    mask +布尔索引

    您可以分配给您的系列:

    df['my_channel'] = df['my_channel'].mask(df['my_channel'] > 20000, 0)

    或者,您可以就地更新系列:

    df['my_channel'].mask(df['my_channel'] > 20000, 0, inplace=True)

    np.where +布尔索引

    可以通过分配当你的条件原系列使用NumPy的满足的; 但是,前两种解决方案更干净,因为它们仅显式更改指定的值。

    df['my_channel'] = np.where(df['my_channel'] > 20000, 0, df['my_channel'])

    The reason your original dataframe does not update is because chained indexing may cause you to modify a copy rather than a view of your dataframe. The docs give this advice:

    When setting values in a pandas object, care must be taken to avoid what is called chained indexing.

    You have a few alternatives:-

    loc + Boolean indexing

    loc may be used for setting values and supports Boolean masks:

    df.loc[df['my_channel'] > 20000, 'my_channel'] = 0
    

    mask + Boolean indexing

    You can assign to your series:

    df['my_channel'] = df['my_channel'].mask(df['my_channel'] > 20000, 0)
    

    Or you can update your series in place:

    df['my_channel'].mask(df['my_channel'] > 20000, 0, inplace=True)
    

    np.where + Boolean indexing

    You can use NumPy by assigning your original series when your condition is not satisfied; however, the first two solutions are cleaner since they explicitly change only specified values.

    df['my_channel'] = np.where(df['my_channel'] > 20000, 0, df['my_channel'])
    

    回答 4

    我会用lambda一个函数SeriesDataFrame是这样的:

    f = lambda x: 0 if x>100 else 1
    df['my_column'] = df['my_column'].map(f)
    

    我没有断言这是一种有效的方法,但是效果很好。

    I would use lambda function on a Series of a DataFrame like this:

    f = lambda x: 0 if x>100 else 1
    df['my_column'] = df['my_column'].map(f)
    

    I do not assert that this is an efficient way, but it works fine.


    回答 5

    试试这个:

    df.my_channel = df.my_channel.where(df.my_channel <= 20000, other= 0)

    要么

    df.my_channel = df.my_channel.mask(df.my_channel > 20000, other= 0)

    Try this:

    df.my_channel = df.my_channel.where(df.my_channel <= 20000, other= 0)

    or

    df.my_channel = df.my_channel.mask(df.my_channel > 20000, other= 0)


    如何替换熊猫数据框的列中的文本?

    问题:如何替换熊猫数据框的列中的文本?

    我的数据框中有这样的一列:

    range
    "(2,30)"
    "(50,290)"
    "(400,1000)"
    ... 
    

    我想,-破折号代替逗号。我目前正在使用此方法,但没有任何更改。

    org_info_exc['range'].replace(',', '-', inplace=True)

    有人可以帮忙吗?

    I have a column in my dataframe like this:

    range
    "(2,30)"
    "(50,290)"
    "(400,1000)"
    ... 
    

    and I want to replace the , comma with - dash. I’m currently using this method but nothing is changed.

    org_info_exc['range'].replace(',', '-', inplace=True)
    

    Can anybody help?


    回答 0

    使用向量化str方法replace

    In [30]:
    
    df['range'] = df['range'].str.replace(',','-')
    df
    Out[30]:
          range
    0    (2-30)
    1  (50-290)

    编辑

    因此,如果我们查看您尝试过的内容以及为何不起作用:

    df['range'].replace(',','-',inplace=True)

    文档中,我们看到以下说明:

    str或regex:str:与to_replace完全匹配的字符串将替换为value

    因此,由于str值不匹配,因此不会发生替换,请与以下内容进行比较:

    In [43]:
    
    df = pd.DataFrame({'range':['(2,30)',',']})
    df['range'].replace(',','-', inplace=True)
    df['range']
    Out[43]:
    0    (2,30)
    1         -
    Name: range, dtype: object

    在这里,我们在第二行获得了完全匹配,并且替换发生了。

    Use the vectorised str method replace:

    In [30]:
    
    df['range'] = df['range'].str.replace(',','-')
    df
    Out[30]:
          range
    0    (2-30)
    1  (50-290)
    

    EDIT

    So if we look at what you tried and why it didn’t work:

    df['range'].replace(',','-',inplace=True)
    

    from the docs we see this desc:

    str or regex: str: string exactly matching to_replace will be replaced with value

    So because the str values do not match, no replacement occurs, compare with the following:

    In [43]:
    
    df = pd.DataFrame({'range':['(2,30)',',']})
    df['range'].replace(',','-', inplace=True)
    df['range']
    Out[43]:
    0    (2,30)
    1         -
    Name: range, dtype: object
    

    here we get an exact match on the second row and the replacement occurs.


    回答 1

    对于其他任何从Google搜索到的人,如何在所有列上进行字符串替换(例如,如果其中有多个列,如OP的“范围”列):Pandas在replace数据框对象上具有内置方法。

    df.replace(',', '-', regex=True)

    资料来源:文件

    For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP’s ‘range’ column): Pandas has a built in replace method available on a dataframe object.

    df.replace(',', '-', regex=True)

    Source: Docs


    回答 2

    在列名称中用下划线替换所有逗号

    data.columns= data.columns.str.replace(' ','_',regex=True)

    Replace all commas with underscore in the column names

    data.columns= data.columns.str.replace(' ','_',regex=True)
    

    回答 3

    另外,对于那些希望替换一列中多个字符的用户,可以使用正则表达式来实现:

    import re
    chars_to_remove = ['.', '-', '(', ')', '']
    regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
    
    df['string_col'].str.replace(regular_expression, '', regex=True)

    In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:

    import re
    chars_to_remove = ['.', '-', '(', ')', '']
    regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
    
    df['string_col'].str.replace(regular_expression, '', regex=True)
    

    替换并覆盖而不是附加

    问题:替换并覆盖而不是附加

    我有以下代码:

    import re
    #open the xml file for reading:
    file = open('path/test.xml','r+')
    #convert to string:
    data = file.read()
    file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
    file.close()
    

    我想用新内容替换文件中的旧内容。但是,当我执行代码时,将附加文件“ test.xml”,即,我的旧内容被新的“替换”内容所取代。为了删除旧内容而只保留新内容,我该怎么办?

    I have the following code:

    import re
    #open the xml file for reading:
    file = open('path/test.xml','r+')
    #convert to string:
    data = file.read()
    file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
    file.close()
    

    where I’d like to replace the old content that’s in the file with the new content. However, when I execute my code, the file “test.xml” is appended, i.e. I have the old content follwed by the new “replaced” content. What can I do in order to delete the old stuff and only keep the new?


    回答 0

    您需要seek先写入文件的开头,然后再使用(file.truncate()如果要进行就地替换):

    import re
    
    myfile = "path/test.xml"
    
    with open(myfile, "r+") as f:
        data = f.read()
        f.seek(0)
        f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
        f.truncate()
    

    另一种方法是读取文件,然后使用再次打开它open(myfile, 'w')

    with open(myfile, "r") as f:
        data = f.read()
    
    with open(myfile, "w") as f:
        f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
    

    无论是truncateopen(..., 'w')将改变inode的文件的数量(我测试过两次,一次是与Ubuntu 12.04 NFS和曾经与EXT4)。

    顺便说一句,这与Python并没有真正的关系。解释器调用相应的低级API。该方法truncate()在C编程语言中的工作原理相同:请参见http://man7.org/linux/man-pages/man2/truncate.2.html

    You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:

    import re
    
    myfile = "path/test.xml"
    
    with open(myfile, "r+") as f:
        data = f.read()
        f.seek(0)
        f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
        f.truncate()
    

    The other way is to read the file then open it again with open(myfile, 'w'):

    with open(myfile, "r") as f:
        data = f.read()
    
    with open(myfile, "w") as f:
        f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
    

    Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).

    By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html


    回答 1

    file='path/test.xml' 
    with open(file, 'w') as filetowrite:
        filetowrite.write('new content')
    

    以“ w”模式打开文件,您将能够替换其当前文本,并使用新内容保存文件。

    file='path/test.xml' 
    with open(file, 'w') as filetowrite:
        filetowrite.write('new content')
    

    Open the file in ‘w’ mode, you will be able to replace its current text save the file with new contents.


    回答 2

    使用truncate(),解决方案可能是

    import re
    #open the xml file for reading:
    with open('path/test.xml','r+') as f:
        #convert to string:
        data = f.read()
        f.seek(0)
        f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
        f.truncate()
    

    Using truncate(), the solution could be

    import re
    #open the xml file for reading:
    with open('path/test.xml','r+') as f:
        #convert to string:
        data = f.read()
        f.seek(0)
        f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
        f.truncate()
    

    回答 3

    import os#must import this library
    if os.path.exists('TwitterDB.csv'):
            os.remove('TwitterDB.csv') #this deletes the file
    else:
            print("The file does not exist")#add this to prevent errors
    

    我遇到了类似的问题,没有使用不同的“模式”覆盖现有文件,而是在再次使用该文件之前删除了该文件,这样就好像我在每次运行代码时都将其追加到新文件上一样。

    import os#must import this library
    if os.path.exists('TwitterDB.csv'):
            os.remove('TwitterDB.csv') #this deletes the file
    else:
            print("The file does not exist")#add this to prevent errors
    

    I had a similar problem, and instead of overwriting my existing file using the different ‘modes’, I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.


    回答 4

    请参阅“如何在文件中替换字符串”,这是一种简单的方法,并且可以解决replace

    fin = open("data.txt", "rt")
    fout = open("out.txt", "wt")
    
    for line in fin:
        fout.write(line.replace('pyton', 'python'))
    
    fin.close()
    fout.close()
    

    See from How to Replace String in File works in a simple way and is an answer that works with replace

    fin = open("data.txt", "rt")
    fout = open("out.txt", "wt")
    
    for line in fin:
        fout.write(line.replace('pyton', 'python'))
    
    fin.close()
    fout.close()
    

    回答 5

    使用python3 pathlib库:

    import re
    from pathlib import Path
    import shutil
    
    shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
    filepath = Path("/tmp/test.xml")
    content = filepath.read_text()
    filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
    

    使用不同方法进行备份的类似方法:

    from pathlib import Path
    
    filepath = Path("/tmp/test.xml")
    filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
    content = filepath.read_text()
    filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
    

    Using python3 pathlib library:

    import re
    from pathlib import Path
    import shutil
    
    shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
    filepath = Path("/tmp/test.xml")
    content = filepath.read_text()
    filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
    

    Similar method using different approach to backups:

    from pathlib import Path
    
    filepath = Path("/tmp/test.xml")
    filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
    content = filepath.read_text()
    filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
    

    如何在python中删除特定字符之后的所有字符?

    问题:如何在python中删除特定字符之后的所有字符?

    我有一个字符串。如何删除某个字符后的所有文本?(在这种情况下...)之后
    的文本将...更改,所以这就是为什么我要删除某个字符之后的所有字符的原因。

    I have a string. How do I remove all text after a certain character? (In this case ...)
    The text after will ... change so I that’s why I want to remove all characters after a certain one.


    回答 0

    最多在分隔器上分割一次,然后取下第一块:

    sep = '...'
    rest = text.split(sep, 1)[0]
    

    您没有说如果不使用分隔符该怎么办。在这种情况下,此方法和Alex的解决方案都将返回整个字符串。

    Split on your separator at most once, and take the first piece:

    sep = '...'
    stripped = text.split(sep, 1)[0]
    

    You didn’t say what should happen if the separator isn’t present. Both this and Alex’s solution will return the entire string in that case.


    回答 1

    假设分隔符为“ …”,但它可以是任何字符串。

    text = 'some string... this part will be removed.'
    head, sep, tail = text.partition('...')
    
    >>> print head
    some string
    

    如果找不到分隔符,head将包含所有原始字符串。

    分区功能是在Python 2.5中添加的。

    分区(…)S.partition(sep)->(head,sep,tail)

    Searches for the separator sep in S, and returns the part before it,
    the separator itself, and the part after it.  If the separator is not
    found, returns S and two empty strings.
    

    Assuming your separator is ‘…’, but it can be any string.

    text = 'some string... this part will be removed.'
    head, sep, tail = text.partition('...')
    
    >>> print head
    some string
    

    If the separator is not found, head will contain all of the original string.

    The partition function was added in Python 2.5.

    partition(…) S.partition(sep) -> (head, sep, tail)

    Searches for the separator sep in S, and returns the part before it,
    the separator itself, and the part after it.  If the separator is not
    found, returns S and two empty strings.
    

    回答 2

    如果要删除字符串中最后一次出现分隔符之后的所有内容,我会发现这很有效:

    <separator>.join(string_to_split.split(<separator>)[:-1])

    例如,如果 string_to_split是像一个路径root/location/child/too_far.exe,你只需要在文件夹路径,您可以通过拆分"/".join(string_to_split.split("/")[:-1]),你会得到 root/location/child

    If you want to remove everything after the last occurrence of separator in a string I find this works well:

    <separator>.join(string_to_split.split(<separator>)[:-1])

    For example, if string_to_split is a path like root/location/child/too_far.exe and you only want the folder path, you can split by "/".join(string_to_split.split("/")[:-1]) and you’ll get root/location/child


    回答 3

    没有RE(我想是您想要的):

    def remafterellipsis(text):
      where_ellipsis = text.find('...')
      if where_ellipsis == -1:
        return text
      return text[:where_ellipsis + 3]

    或者,使用RE:

    import re
    
    def remwithre(text, there=re.compile(re.escape('...')+'.*')):
      return there.sub('', text)

    Without a RE (which I assume is what you want):

    def remafterellipsis(text):
      where_ellipsis = text.find('...')
      if where_ellipsis == -1:
        return text
      return text[:where_ellipsis + 3]
    

    or, with a RE:

    import re
    
    def remwithre(text, there=re.compile(re.escape('...')+'.*')):
      return there.sub('', text)
    

    回答 4

    find方法将返回字符串中的字符位置。然后,如果要从角色中删除所有内容,请执行以下操作:

    mystring = "123⋯567"
    mystring[ 0 : mystring.index("⋯")]
    
    >> '123'

    如果要保留字符,请在字符位置加1。

    The method find will return the character position in a string. Then, if you want remove every thing from the character, do this:

    mystring = "123⋯567"
    mystring[ 0 : mystring.index("⋯")]
    
    >> '123'
    

    If you want to keep the character, add 1 to the character position.


    回答 5

    import re
    test = "This is a test...we should not be able to see this"
    res = re.sub(r'\.\.\..*',"",test)
    print(res)

    输出:“这是一个测试”

    import re
    test = "This is a test...we should not be able to see this"
    res = re.sub(r'\.\.\..*',"",test)
    print(res)
    

    Output: “This is a test”


    回答 6

    从文件中:

    import re
    sep = '...'
    
    with open("requirements.txt") as file_in:
        lines = []
        for line in file_in:
            res = line.split(sep, 1)[0]
            print(res)

    From a file:

    import re
    sep = '...'
    
    with open("requirements.txt") as file_in:
        lines = []
        for line in file_in:
            res = line.split(sep, 1)[0]
            print(res)
    

    回答 7

    使用re的另一种简单方法是

    import re, clr
    
    text = 'some string... this part will be removed.'
    
    text= re.search(r'(\A.*)\.\.\..+',url,re.DOTALL|re.IGNORECASE).group(1)
    
    // text = some string

    another easy way using re will be

    import re, clr
    
    text = 'some string... this part will be removed.'
    
    text= re.search(r'(\A.*)\.\.\..+',url,re.DOTALL|re.IGNORECASE).group(1)
    
    // text = some string
    

    删除字符串中的字符列表

    问题:删除字符串中的字符列表

    我想在python中删除字符串中的字符:

    string.replace(',', '').replace("!", '').replace(":", '').replace(";", '')...

    但是我必须删除许多字符。我想到了一个清单

    list = [',', '!', '.', ';'...]

    但是,如何使用list来替换中的字符string

    I want to remove characters in a string in python:

    string.replace(',', '').replace("!", '').replace(":", '').replace(";", '')...
    

    But I have many characters I have to remove. I thought about a list

    list = [',', '!', '.', ';'...]
    

    But how can I use the list to replace the characters in the string?


    回答 0

    如果您使用的是python2,而您的输入是字符串(不是unicodes),则绝对最佳的方法是str.translate

    >>> chars_to_remove = ['.', '!', '?']
    >>> subj = 'A.B!C?'
    >>> subj.translate(None, ''.join(chars_to_remove))
    'ABC'

    否则,可以考虑以下选项:

    A.通过char迭代主题char,省略不需要的字符和join结果列表:

    >>> sc = set(chars_to_remove)
    >>> ''.join([c for c in subj if c not in sc])
    'ABC'

    (请注意,生成器版本 ''.join(c for c ...)效率较低)。

    B.动态创建一个正则表达式,并re.sub带有一个空字符串:

    >>> import re
    >>> rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    >>> re.sub(rx, '', subj)
    'ABC'

    re.escape确保字符喜欢^]不会破坏正则表达式)。

    C.使用以下映射的变体translate

    >>> chars_to_remove = [u'δ', u'Γ', u'ж']
    >>> subj = u'AжBδCΓ'
    >>> dd = {ord(c):None for c in chars_to_remove}
    >>> subj.translate(dd)
    u'ABC'

    完整的测试代码和计时:

    #coding=utf8
    
    import re
    
    def remove_chars_iter(subj, chars):
        sc = set(chars)
        return ''.join([c for c in subj if c not in sc])
    
    def remove_chars_re(subj, chars):
        return re.sub('[' + re.escape(''.join(chars)) + ']', '', subj)
    
    def remove_chars_re_unicode(subj, chars):
        return re.sub(u'(?u)[' + re.escape(''.join(chars)) + ']', '', subj)
    
    def remove_chars_translate_bytes(subj, chars):
        return subj.translate(None, ''.join(chars))
    
    def remove_chars_translate_unicode(subj, chars):
        d = {ord(c):None for c in chars}
        return subj.translate(d)
    
    import timeit, sys
    
    def profile(f):
        assert f(subj, chars_to_remove) == test
        t = timeit.timeit(lambda: f(subj, chars_to_remove), number=1000)
        print ('{0:.3f} {1}'.format(t, f.__name__))
    
    print (sys.version)
    PYTHON2 = sys.version_info[0] == 2
    
    print ('\n"plain" string:\n')
    
    chars_to_remove = ['.', '!', '?']
    subj = 'A.B!C?' * 1000
    test = 'ABC' * 1000
    
    profile(remove_chars_iter)
    profile(remove_chars_re)
    
    if PYTHON2:
        profile(remove_chars_translate_bytes)
    else:
        profile(remove_chars_translate_unicode)
    
    print ('\nunicode string:\n')
    
    if PYTHON2:
        chars_to_remove = [u'δ', u'Γ', u'ж']
        subj = u'AжBδCΓ'
    else:
        chars_to_remove = ['δ', 'Γ', 'ж']
        subj = 'AжBδCΓ'
    
    subj = subj * 1000
    test = 'ABC' * 1000
    
    profile(remove_chars_iter)
    
    if PYTHON2:
        profile(remove_chars_re_unicode)
    else:
        profile(remove_chars_re)
    
    profile(remove_chars_translate_unicode)

    结果:

    2.7.5 (default, Mar  9 2014, 22:15:05) 
    [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]
    
    "plain" string:
    
    0.637 remove_chars_iter
    0.649 remove_chars_re
    0.010 remove_chars_translate_bytes
    
    unicode string:
    
    0.866 remove_chars_iter
    0.680 remove_chars_re_unicode
    1.373 remove_chars_translate_unicode
    
    ---
    
    3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
    
    "plain" string:
    
    0.512 remove_chars_iter
    0.574 remove_chars_re
    0.765 remove_chars_translate_unicode
    
    unicode string:
    
    0.817 remove_chars_iter
    0.686 remove_chars_re
    0.876 remove_chars_translate_unicode

    (作为附带说明,该数字remove_chars_translate_bytes可能为我们提供了一个线索,说明为什么该行业这么长时间不愿采用Unicode)。

    If you’re using python2 and your inputs are strings (not unicodes), the absolutely best method is str.translate:

    >>> chars_to_remove = ['.', '!', '?']
    >>> subj = 'A.B!C?'
    >>> subj.translate(None, ''.join(chars_to_remove))
    'ABC'
    

    Otherwise, there are following options to consider:

    A. Iterate the subject char by char, omit unwanted characters and join the resulting list:

    >>> sc = set(chars_to_remove)
    >>> ''.join([c for c in subj if c not in sc])
    'ABC'
    

    (Note that the generator version ''.join(c for c ...) will be less efficient).

    B. Create a regular expression on the fly and re.sub with an empty string:

    >>> import re
    >>> rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    >>> re.sub(rx, '', subj)
    'ABC'
    

    (re.escape ensures that characters like ^ or ] won’t break the regular expression).

    C. Use the mapping variant of translate:

    >>> chars_to_remove = [u'δ', u'Γ', u'ж']
    >>> subj = u'AжBδCΓ'
    >>> dd = {ord(c):None for c in chars_to_remove}
    >>> subj.translate(dd)
    u'ABC'
    

    Full testing code and timings:

    #coding=utf8
    
    import re
    
    def remove_chars_iter(subj, chars):
        sc = set(chars)
        return ''.join([c for c in subj if c not in sc])
    
    def remove_chars_re(subj, chars):
        return re.sub('[' + re.escape(''.join(chars)) + ']', '', subj)
    
    def remove_chars_re_unicode(subj, chars):
        return re.sub(u'(?u)[' + re.escape(''.join(chars)) + ']', '', subj)
    
    def remove_chars_translate_bytes(subj, chars):
        return subj.translate(None, ''.join(chars))
    
    def remove_chars_translate_unicode(subj, chars):
        d = {ord(c):None for c in chars}
        return subj.translate(d)
    
    import timeit, sys
    
    def profile(f):
        assert f(subj, chars_to_remove) == test
        t = timeit.timeit(lambda: f(subj, chars_to_remove), number=1000)
        print ('{0:.3f} {1}'.format(t, f.__name__))
    
    print (sys.version)
    PYTHON2 = sys.version_info[0] == 2
    
    print ('\n"plain" string:\n')
    
    chars_to_remove = ['.', '!', '?']
    subj = 'A.B!C?' * 1000
    test = 'ABC' * 1000
    
    profile(remove_chars_iter)
    profile(remove_chars_re)
    
    if PYTHON2:
        profile(remove_chars_translate_bytes)
    else:
        profile(remove_chars_translate_unicode)
    
    print ('\nunicode string:\n')
    
    if PYTHON2:
        chars_to_remove = [u'δ', u'Γ', u'ж']
        subj = u'AжBδCΓ'
    else:
        chars_to_remove = ['δ', 'Γ', 'ж']
        subj = 'AжBδCΓ'
    
    subj = subj * 1000
    test = 'ABC' * 1000
    
    profile(remove_chars_iter)
    
    if PYTHON2:
        profile(remove_chars_re_unicode)
    else:
        profile(remove_chars_re)
    
    profile(remove_chars_translate_unicode)
    

    Results:

    2.7.5 (default, Mar  9 2014, 22:15:05) 
    [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]
    
    "plain" string:
    
    0.637 remove_chars_iter
    0.649 remove_chars_re
    0.010 remove_chars_translate_bytes
    
    unicode string:
    
    0.866 remove_chars_iter
    0.680 remove_chars_re_unicode
    1.373 remove_chars_translate_unicode
    
    ---
    
    3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
    
    "plain" string:
    
    0.512 remove_chars_iter
    0.574 remove_chars_re
    0.765 remove_chars_translate_unicode
    
    unicode string:
    
    0.817 remove_chars_iter
    0.686 remove_chars_re
    0.876 remove_chars_translate_unicode
    

    (As a side note, the figure for remove_chars_translate_bytes might give us a clue why the industry was reluctant to adopt Unicode for such a long time).


    回答 1

    您可以使用str.translate()

    s.translate(None, ",!.;")

    例:

    >>> s = "asjo,fdjk;djaso,oio!kod.kjods;dkps"
    >>> s.translate(None, ",!.;")
    'asjofdjkdjasooiokodkjodsdkps'

    You can use str.translate():

    s.translate(None, ",!.;")
    

    Example:

    >>> s = "asjo,fdjk;djaso,oio!kod.kjods;dkps"
    >>> s.translate(None, ",!.;")
    'asjofdjkdjasooiokodkjodsdkps'
    

    回答 2

    您可以使用翻译方法。

    s.translate(None, '!.;,')

    You can use the translate method.

    s.translate(None, '!.;,')
    

    回答 3

    ''.join(c for c in myString if not c in badTokens)
    ''.join(c for c in myString if not c in badTokens)
    

    回答 4

    如果您使用的是python3并且正在寻找translate解决方案-函数已更改,现在使用1个参数而不是2个参数。

    该参数是一个表(可以是字典),其中每个键是要查找的字符的Unicode序数(int),值是替换字符(可以是Unicode序数或将键映射到的字符串)。

    这是一个用法示例:

    >>> list = [',', '!', '.', ';']
    >>> s = "This is, my! str,ing."
    >>> s.translate({ord(x): '' for x in list})
    'This is my string'

    If you are using python3 and looking for the translate solution – the function was changed and now takes 1 parameter instead of 2.

    That parameter is a table (can be dictionary) where each key is the Unicode ordinal (int) of the character to find and the value is the replacement (can be either a Unicode ordinal or a string to map the key to).

    Here is a usage example:

    >>> list = [',', '!', '.', ';']
    >>> s = "This is, my! str,ing."
    >>> s.translate({ord(x): '' for x in list})
    'This is my string'
    

    回答 5

    使用正则表达式的另一种方法:

    ''.join(re.split(r'[.;!?,]', s))

    Another approach using regex:

    ''.join(re.split(r'[.;!?,]', s))
    

    回答 6

    为什么不进行简单循环?

    for i in replace_list:
        string = string.replace(i, '')

    另外,避免将列表命名为“列表”。它覆盖了内置函数list

    Why not a simple loop?

    for i in replace_list:
        string = string.replace(i, '')
    

    Also, avoid naming lists ‘list’. It overrides the built-in function list.


    回答 7

    你可以用这样的东西

    def replace_all(text, dic):
      for i, j in dic.iteritems():
        text = text.replace(i, j)
      return text

    这段代码不是我自己的,来自这里,是一篇很棒的文章,并深入探讨了这一点。

    you could use something like this

    def replace_all(text, dic):
      for i, j in dic.iteritems():
        text = text.replace(i, j)
      return text
    

    This code is not my own and comes from here its a great article and dicusses in depth doing this


    回答 8

    关于从字符串中将char转换为标准非重音char的字符串删除UTF-8重音也是一个有趣的话题:

    删除python unicode字符串中的重音符号的最佳方法是什么?

    来自主题的代码摘录:

    import unicodedata
    
    def remove_accents(input_str):
        nkfd_form = unicodedata.normalize('NFKD', input_str)
        return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

    Also an interesting topic on removal UTF-8 accent form a string converting char to their standard non-accentuated char:

    What is the best way to remove accents in a python unicode string?

    code extract from the topic:

    import unicodedata
    
    def remove_accents(input_str):
        nkfd_form = unicodedata.normalize('NFKD', input_str)
        return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
    

    回答 9

    也许是一种更现代和实用的方法来实现您的期望:

    >>> subj = 'A.B!C?'
    >>> list = set([',', '!', '.', ';', '?'])
    >>> filter(lambda x: x not in list, subj)
    'ABC'

    请注意,对于此特定目的,这是一个过大的杀伤力,但是一旦您需要更复杂的条件,过滤器就会派上用场

    Perhaps a more modern and functional way to achieve what you wish:

    >>> subj = 'A.B!C?'
    >>> list = set([',', '!', '.', ';', '?'])
    >>> filter(lambda x: x not in list, subj)
    'ABC'
    

    please note that for this particular purpose it’s quite an overkill, but once you need more complex conditions, filter comes handy


    回答 10

    简单的方法

    import re
    str = 'this is string !    >><< (foo---> bar) @-tuna-#   sandwich-%-is-$-* good'
    
    // condense multiple empty spaces into 1
    str = ' '.join(str.split()
    
    // replace empty space with dash
    str = str.replace(" ","-")
    
    // take out any char that matches regex
    str = re.sub('[!@#$%^&*()_+<>]', '', str)

    输出:

    this-is-string--foo----bar--tuna---sandwich--is---good

    simple way,

    import re
    str = 'this is string !    >><< (foo---> bar) @-tuna-#   sandwich-%-is-$-* good'
    
    // condense multiple empty spaces into 1
    str = ' '.join(str.split()
    
    // replace empty space with dash
    str = str.replace(" ","-")
    
    // take out any char that matches regex
    str = re.sub('[!@#$%^&*()_+<>]', '', str)
    

    output:

    this-is-string--foo----bar--tuna---sandwich--is---good


    回答 11

    怎么样-一个衬垫。

    reduce(lambda x,y : x.replace(y,"") ,[',', '!', '.', ';'],";Test , ,  !Stri!ng ..")

    How about this – a one liner.

    reduce(lambda x,y : x.replace(y,"") ,[',', '!', '.', ';'],";Test , ,  !Stri!ng ..")
    

    回答 12

    我认为这很简单并且可以!

    list = [",",",","!",";",":"] #the list goes on.....
    
    theString = "dlkaj;lkdjf'adklfaj;lsd'fa'dfj;alkdjf" #is an example string;
    newString="" #the unwanted character free string
    for i in range(len(TheString)):
        if theString[i] in list:
            newString += "" #concatenate an empty string.
        else:
            newString += theString[i]

    这是做到这一点的一种方法。但是,如果您厌倦了要保留要删除的字符列表,则实际上可以使用迭代的字符串的顺序号来完成。订单号是该字符的ascii值。0作为字符的ascii数为48,小写字母z的ascii数为122,因此:

    theString = "lkdsjf;alkd8a'asdjf;lkaheoialkdjf;ad"
    newString = ""
    for i in range(len(theString)):
         if ord(theString[i]) < 48 or ord(theString[i]) > 122: #ord() => ascii num.
             newString += ""
         else:
            newString += theString[i]

    i think this is simple enough and will do!

    list = [",",",","!",";",":"] #the list goes on.....
    
    theString = "dlkaj;lkdjf'adklfaj;lsd'fa'dfj;alkdjf" #is an example string;
    newString="" #the unwanted character free string
    for i in range(len(TheString)):
        if theString[i] in list:
            newString += "" #concatenate an empty string.
        else:
            newString += theString[i]
    

    this is one way to do it. But if you are tired of keeping a list of characters that you want to remove, you can actually do it by using the order number of the strings you iterate through. the order number is the ascii value of that character. the ascii number for 0 as a char is 48 and the ascii number for lower case z is 122 so:

    theString = "lkdsjf;alkd8a'asdjf;lkaheoialkdjf;ad"
    newString = ""
    for i in range(len(theString)):
         if ord(theString[i]) < 48 or ord(theString[i]) > 122: #ord() => ascii num.
             newString += ""
         else:
            newString += theString[i]
    

    回答 13

    这些天,我开始研究计划,现在我认为擅长递归和评估。哈哈哈 只需分享一些新方法:

    首先,评估一下

    print eval('string%s' % (''.join(['.replace("%s","")'%i for i in replace_list])))

    第二,递归

    def repn(string,replace_list):
        if replace_list==[]:
            return string
        else:
            return repn(string.replace(replace_list.pop(),""),replace_list)
    
    print repn(string,replace_list)

    嘿,别投票。我只想分享一些新想法。

    These days I am diving into scheme, and now I think am good at recursing and eval. HAHAHA. Just share some new ways:

    first ,eval it

    print eval('string%s' % (''.join(['.replace("%s","")'%i for i in replace_list])))
    

    second , recurse it

    def repn(string,replace_list):
        if replace_list==[]:
            return string
        else:
            return repn(string.replace(replace_list.pop(),""),replace_list)
    
    print repn(string,replace_list)
    

    Hey ,don’t downvote. I am just want to share some new idea.


    回答 14

    我正在考虑为此的解决方案。首先,我将字符串输入作为列表。然后,我将替换列表中的项目。然后通过使用join命令,我将以字符串形式返回list。代码可以像这样:

    def the_replacer(text):
        test = []    
        for m in range(len(text)):
            test.append(text[m])
            if test[m]==','\
            or test[m]=='!'\
            or test[m]=='.'\
            or test[m]=='\''\
            or test[m]==';':
        #....
                test[n]=''
        return ''.join(test)

    这将从字符串中删除任何内容。您对此有何看法?

    I am thinking about a solution for this. First I would make the string input as a list. Then I would replace the items of list. Then through using join command, I will return list as a string. The code can be like this:

    def the_replacer(text):
        test = []    
        for m in range(len(text)):
            test.append(text[m])
            if test[m]==','\
            or test[m]=='!'\
            or test[m]=='.'\
            or test[m]=='\''\
            or test[m]==';':
        #....
                test[n]=''
        return ''.join(test)
    

    This would remove anything from the string. What do you think about that?


    回答 15

    这是一种more_itertools方法:

    import more_itertools as mit
    
    
    s = "A.B!C?D_E@F#"
    blacklist = ".!?_@#"
    
    "".join(mit.flatten(mit.split_at(s, pred=lambda x: x in set(blacklist))))
    # 'ABCDEF'

    在这里,我们分割了在中找到的项目blacklist,将结果展平并加入字符串。

    Here is a more_itertools approach:

    import more_itertools as mit
    
    
    s = "A.B!C?D_E@F#"
    blacklist = ".!?_@#"
    
    "".join(mit.flatten(mit.split_at(s, pred=lambda x: x in set(blacklist))))
    # 'ABCDEF'
    

    Here we split upon items found in the blacklist, flatten the results and join the string.


    回答 16

    Python 3,单行列表理解实现。

    from string import ascii_lowercase # 'abcdefghijklmnopqrstuvwxyz'
    def remove_chars(input_string, removable):
      return ''.join([_ for _ in input_string if _ not in removable])
    
    print(remove_chars(input_string="Stack Overflow", removable=ascii_lowercase))
    >>> 'S O'

    Python 3, single line list comprehension implementation.

    from string import ascii_lowercase # 'abcdefghijklmnopqrstuvwxyz'
    def remove_chars(input_string, removable):
      return ''.join([_ for _ in input_string if _ not in removable])
    
    print(remove_chars(input_string="Stack Overflow", removable=ascii_lowercase))
    >>> 'S O'
    

    回答 17

    去掉 *%,&@!从下面的字符串:

    s = "this is my string,  and i will * remove * these ** %% "
    new_string = s.translate(s.maketrans('','','*%,&@!'))
    print(new_string)
    
    # output: this is my string  and i will  remove  these  

    Remove *%,&@! from below string:

    s = "this is my string,  and i will * remove * these ** %% "
    new_string = s.translate(s.maketrans('','','*%,&@!'))
    print(new_string)
    
    # output: this is my string  and i will  remove  these  
    

    如何搜索和替换文件中的文本?

    问题:如何搜索和替换文件中的文本?

    如何使用Python 3搜索和替换文件中的文本?

    这是我的代码:

    import os
    import sys
    import fileinput
    
    print ("Text to search for:")
    textToSearch = input( "> " )
    
    print ("Text to replace it with:")
    textToReplace = input( "> " )
    
    print ("File to perform Search-Replace on:")
    fileToSearch  = input( "> " )
    #fileToSearch = 'D:\dummy1.txt'
    
    tempFile = open( fileToSearch, 'r+' )
    
    for line in fileinput.input( fileToSearch ):
        if textToSearch in line :
            print('Match Found')
        else:
            print('Match Not Found!!')
        tempFile.write( line.replace( textToSearch, textToReplace ) )
    tempFile.close()
    
    
    input( '\n\n Press Enter to exit...' )

    输入文件:

    hi this is abcd hi this is abcd
    This is dummy text file.
    This is how search and replace works abcd

    当我在上面的输入文件中搜索并将“ ram”替换为“ abcd”时,它起了一种魅力。但是,反之亦然,即用“ ram”替换“ abcd”时,一些垃圾字符会保留在末尾。

    用“ ram”代替“ abcd”

    hi this is ram hi this is ram
    This is dummy text file.
    This is how search and replace works rambcd

    How do I search and replace text in a file using Python 3?

    Here is my code:

    import os
    import sys
    import fileinput
    
    print ("Text to search for:")
    textToSearch = input( "> " )
    
    print ("Text to replace it with:")
    textToReplace = input( "> " )
    
    print ("File to perform Search-Replace on:")
    fileToSearch  = input( "> " )
    #fileToSearch = 'D:\dummy1.txt'
    
    tempFile = open( fileToSearch, 'r+' )
    
    for line in fileinput.input( fileToSearch ):
        if textToSearch in line :
            print('Match Found')
        else:
            print('Match Not Found!!')
        tempFile.write( line.replace( textToSearch, textToReplace ) )
    tempFile.close()
    
    
    input( '\n\n Press Enter to exit...' )
    

    Input file:

    hi this is abcd hi this is abcd
    This is dummy text file.
    This is how search and replace works abcd
    

    When I search and replace ‘ram’ by ‘abcd’ in above input file, it works as a charm. But when I do it vice-versa i.e. replacing ‘abcd’ by ‘ram’, some junk characters are left at the end.

    Replacing ‘abcd’ by ‘ram’

    hi this is ram hi this is ram
    This is dummy text file.
    This is how search and replace works rambcd
    

    回答 0

    fileinput已经支持就地编辑。stdout在这种情况下,它将重定向到文件:

    #!/usr/bin/env python3
    import fileinput
    
    with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
        for line in file:
            print(line.replace(text_to_search, replacement_text), end='')

    fileinput already supports inplace editing. It redirects stdout to the file in this case:

    #!/usr/bin/env python3
    import fileinput
    
    with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
        for line in file:
            print(line.replace(text_to_search, replacement_text), end='')
    

    回答 1

    正如michaelb958指出的那样,您不能用其他长度的数据替换在原处,因为这会使其余部分无法正确放置。我不同意其他海报,建议您从一个文件中读取并写入另一个文件。相反,我将文件读入内存,修复数据,然后在单独的步骤中将其写出到同一文件中。

    # Read in the file
    with open('file.txt', 'r') as file :
      filedata = file.read()
    
    # Replace the target string
    filedata = filedata.replace('ram', 'abcd')
    
    # Write the file out again
    with open('file.txt', 'w') as file:
      file.write(filedata)

    除非您要处理的海量文件太大而无法一次加载到内存中,否则除非担心在将数据写入文件的第二步过程中该过程中断,否则您可能会担心数据丢失。

    As pointed out by michaelb958, you cannot replace in place with data of a different length because this will put the rest of the sections out of place. I disagree with the other posters suggesting you read from one file and write to another. Instead, I would read the file into memory, fix the data up, and then write it out to the same file in a separate step.

    # Read in the file
    with open('file.txt', 'r') as file :
      filedata = file.read()
    
    # Replace the target string
    filedata = filedata.replace('ram', 'abcd')
    
    # Write the file out again
    with open('file.txt', 'w') as file:
      file.write(filedata)
    

    Unless you’ve got a massive file to work with which is too big to load into memory in one go, or you are concerned about potential data loss if the process is interrupted during the second step in which you write data to the file.


    回答 2

    正如杰克·艾德利(Jack Aidley)张贴的文章和JF Sebastian指出的那样,此代码不起作用:

     # Read in the file
    filedata = None
    with file = open('file.txt', 'r') :
      filedata = file.read()
    
    # Replace the target string
    filedata.replace('ram', 'abcd')
    
    # Write the file out again
    with file = open('file.txt', 'w') :
      file.write(filedata)`

    但是此代码将起作用(我已经对其进行了测试):

    f = open(filein,'r')
    filedata = f.read()
    f.close()
    
    newdata = filedata.replace("old data","new data")
    
    f = open(fileout,'w')
    f.write(newdata)
    f.close()

    使用此方法,filein和fileout可以是同一文件,因为Python 3.3在打开进行写操作时会覆盖该文件。

    As Jack Aidley had posted and J.F. Sebastian pointed out, this code will not work:

     # Read in the file
    filedata = None
    with file = open('file.txt', 'r') :
      filedata = file.read()
    
    # Replace the target string
    filedata.replace('ram', 'abcd')
    
    # Write the file out again
    with file = open('file.txt', 'w') :
      file.write(filedata)`
    

    But this code WILL work (I’ve tested it):

    f = open(filein,'r')
    filedata = f.read()
    f.close()
    
    newdata = filedata.replace("old data","new data")
    
    f = open(fileout,'w')
    f.write(newdata)
    f.close()
    

    Using this method, filein and fileout can be the same file, because Python 3.3 will overwrite the file upon opening for write.


    回答 3

    你可以这样替换

    f1 = open('file1.txt', 'r')
    f2 = open('file2.txt', 'w')
    for line in f1:
        f2.write(line.replace('old_text', 'new_text'))
    f1.close()
    f2.close()

    You can do the replacement like this

    f1 = open('file1.txt', 'r')
    f2 = open('file2.txt', 'w')
    for line in f1:
        f2.write(line.replace('old_text', 'new_text'))
    f1.close()
    f2.close()
    

    回答 4

    您也可以使用pathlib

    from pathlib2 import Path
    path = Path(file_to_search)
    text = path.read_text()
    text = text.replace(text_to_search, replacement_text)
    path.write_text(text)

    You can also use pathlib.

    from pathlib2 import Path
    path = Path(file_to_search)
    text = path.read_text()
    text = text.replace(text_to_search, replacement_text)
    path.write_text(text)
    

    回答 5

    使用单个with块,您可以搜索和替换文本:

    with open('file.txt','r+') as f:
        filedata = f.read()
        filedata = filedata.replace('abc','xyz')
        f.truncate(0)
        f.write(filedata)

    With a single with block, you can search and replace your text:

    with open('file.txt','r+') as f:
        filedata = f.read()
        filedata = filedata.replace('abc','xyz')
        f.truncate(0)
        f.write(filedata)
    

    回答 6

    您的问题源于读取和写入同一文件。无需打开fileToSearch进行写入,而是打开实际的临时文件,然后在完成并关闭后tempFile,使用os.rename将新文件移到上方fileToSearch

    Your problem stems from reading from and writing to the same file. Rather than opening fileToSearch for writing, open an actual temporary file and then after you’re done and have closed tempFile, use os.rename to move the new file over fileToSearch.


    回答 7

    (pip install python-util)

    from pyutil import filereplace
    
    filereplace("somefile.txt","abcd","ram")

    第二个参数(要替换的事物,例如“ abcd”也可以是正则表达式)
    将替换所有出现的事件

    (pip install python-util)

    from pyutil import filereplace
    
    filereplace("somefile.txt","abcd","ram")
    

    The second parameter (the thing to be replaced, e.g. “abcd” can also be a regex)
    Will replace all occurences


    回答 8

    我的变体,在整个文件上一次一个字。

    我将其读入内存。

    def replace_word(infile,old_word,new_word):
        if not os.path.isfile(infile):
            print ("Error on replace_word, not a regular file: "+infile)
            sys.exit(1)
    
        f1=open(infile,'r').read()
        f2=open(infile,'w')
        m=f1.replace(old_word,new_word)
        f2.write(m)

    My variant, one word at a time on the entire file.

    I read it into memory.

    def replace_word(infile,old_word,new_word):
        if not os.path.isfile(infile):
            print ("Error on replace_word, not a regular file: "+infile)
            sys.exit(1)
    
        f1=open(infile,'r').read()
        f2=open(infile,'w')
        m=f1.replace(old_word,new_word)
        f2.write(m)
    

    回答 9

    我已经做到了:

    #!/usr/bin/env python3
    
    import fileinput
    import os
    
    Dir = input ("Source directory: ")
    os.chdir(Dir)
    
    Filelist = os.listdir()
    print('File list: ',Filelist)
    
    NomeFile = input ("Insert file name: ")
    
    CarOr = input ("Text to search: ")
    
    CarNew = input ("New text: ")
    
    with fileinput.FileInput(NomeFile, inplace=True, backup='.bak') as file:
        for line in file:
            print(line.replace(CarOr, CarNew), end='')
    
    file.close ()

    I have done this:

    #!/usr/bin/env python3
    
    import fileinput
    import os
    
    Dir = input ("Source directory: ")
    os.chdir(Dir)
    
    Filelist = os.listdir()
    print('File list: ',Filelist)
    
    NomeFile = input ("Insert file name: ")
    
    CarOr = input ("Text to search: ")
    
    CarNew = input ("New text: ")
    
    with fileinput.FileInput(NomeFile, inplace=True, backup='.bak') as file:
        for line in file:
            print(line.replace(CarOr, CarNew), end='')
    
    file.close ()
    

    回答 10

    我稍微修改了Jayram Singh的帖子,以替换每个“!”实例。字符到我想随每个实例增加的数字。认为这对希望修改每行出现多次且要迭代的字符可能会有所帮助。希望能对某人有所帮助。PS-我对编码非常陌生,因此如果我的帖子在任何方面都不适当,我深表歉意,但这对我有用。

    f1 = open('file1.txt', 'r')
    f2 = open('file2.txt', 'w')
    n = 1  
    
    # if word=='!'replace w/ [n] & increment n; else append same word to     
    # file2
    
    for line in f1:
        for word in line:
            if word == '!':
                f2.write(word.replace('!', f'[{n}]'))
                n += 1
            else:
                f2.write(word)
    f1.close()
    f2.close()

    I modified Jayram Singh’s post slightly in order to replace every instance of a ‘!’ character to a number which I wanted to increment with each instance. Thought it might be helpful to someone who wanted to modify a character that occurred more than once per line and wanted to iterate. Hope that helps someone. PS- I’m very new at coding so apologies if my post is inappropriate in any way, but this worked for me.

    f1 = open('file1.txt', 'r')
    f2 = open('file2.txt', 'w')
    n = 1  
    
    # if word=='!'replace w/ [n] & increment n; else append same word to     
    # file2
    
    for line in f1:
        for word in line:
            if word == '!':
                f2.write(word.replace('!', f'[{n}]'))
                n += 1
            else:
                f2.write(word)
    f1.close()
    f2.close()
    

    回答 11

    def word_replace(filename,old,new):
        c=0
        with open(filename,'r+',encoding ='utf-8') as f:
            a=f.read()
            b=a.split()
            for i in range(0,len(b)):
                if b[i]==old:
                    c=c+1
            old=old.center(len(old)+2)
            new=new.center(len(new)+2)
            d=a.replace(old,new,c)
            f.truncate(0)
            f.seek(0)
            f.write(d)
        print('All words have been replaced!!!')
    def word_replace(filename,old,new):
        c=0
        with open(filename,'r+',encoding ='utf-8') as f:
            a=f.read()
            b=a.split()
            for i in range(0,len(b)):
                if b[i]==old:
                    c=c+1
            old=old.center(len(old)+2)
            new=new.center(len(new)+2)
            d=a.replace(old,new,c)
            f.truncate(0)
            f.seek(0)
            f.write(d)
        print('All words have been replaced!!!')
    

    回答 12

    像这样:

    def find_and_replace(file, word, replacement):
      with open(file, 'r+') as f:
        text = f.read()
        f.write(text.replace(word, replacement))

    Like so:

    def find_and_replace(file, word, replacement):
      with open(file, 'r+') as f:
        text = f.read()
        f.write(text.replace(word, replacement))
    

    回答 13

    def findReplace(find, replace):
    
        import os 
    
        src = os.path.join(os.getcwd(), os.pardir) 
    
        for path, dirs, files in os.walk(os.path.abspath(src)):
    
            for name in files: 
    
                if name.endswith('.py'): 
    
                    filepath = os.path.join(path, name)
    
                    with open(filepath) as f: 
    
                        s = f.read()
    
                    s = s.replace(find, replace) 
    
                    with open(filepath, "w") as f:
    
                        f.write(s) 
    def findReplace(find, replace):
    
        import os 
    
        src = os.path.join(os.getcwd(), os.pardir) 
    
        for path, dirs, files in os.walk(os.path.abspath(src)):
    
            for name in files: 
    
                if name.endswith('.py'): 
    
                    filepath = os.path.join(path, name)
    
                    with open(filepath) as f: 
    
                        s = f.read()
    
                    s = s.replace(find, replace) 
    
                    with open(filepath, "w") as f:
    
                        f.write(s) 
    

    替换字符串中多个字符的最佳方法?

    问题:替换字符串中多个字符的最佳方法?

    我需要替换一些字符,如下所示:&\&#\#,…

    我编码如下,但是我想应该有一些更好的方法。有什么提示吗?

    strs = strs.replace('&', '\&')
    strs = strs.replace('#', '\#')
    ...

    I need to replace some characters as follows: &\&, #\#, …

    I coded as follows, but I guess there should be some better way. Any hints?

    strs = strs.replace('&', '\&')
    strs = strs.replace('#', '\#')
    ...
    

    回答 0

    替换两个字符

    我给当前答案中的所有方法加上了一个额外的时间。

    使用输入字符串abc&def#ghi并替换&-> \&和#-> \#,最快的方法是将替换链接在一起,如下所示:text.replace('&', '\&').replace('#', '\#')

    每个功能的时间:

    • a)1000000次循环,每循环3:1.47μs最佳
    • b)1000000次循环,每循环3:1.51μs最佳
    • c)100000个循环,每个循环最好为3:12.3μs
    • d)100000个循环,每个循环最好为3:12μs
    • e)100000个循环,每个循环最好为3:3.27μs
    • f)1000000次循环,最好为3:每个循环0.817μs
    • g)100000个循环,每个循环3:3.64μs最佳
    • h)1000000次循环,每循环最好3:0.927μs
    • i)1000000次循环,最佳3:每个循环0.814μs

    功能如下:

    def a(text):
        chars = "&#"
        for c in chars:
            text = text.replace(c, "\\" + c)
    
    
    def b(text):
        for ch in ['&','#']:
            if ch in text:
                text = text.replace(ch,"\\"+ch)
    
    
    import re
    def c(text):
        rx = re.compile('([&#])')
        text = rx.sub(r'\\\1', text)
    
    
    RX = re.compile('([&#])')
    def d(text):
        text = RX.sub(r'\\\1', text)
    
    
    def mk_esc(esc_chars):
        return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
    esc = mk_esc('&#')
    def e(text):
        esc(text)
    
    
    def f(text):
        text = text.replace('&', '\&').replace('#', '\#')
    
    
    def g(text):
        replacements = {"&": "\&", "#": "\#"}
        text = "".join([replacements.get(c, c) for c in text])
    
    
    def h(text):
        text = text.replace('&', r'\&')
        text = text.replace('#', r'\#')
    
    
    def i(text):
        text = text.replace('&', r'\&').replace('#', r'\#')

    像这样计时:

    python -mtimeit -s"import time_functions" "time_functions.a('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.b('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.c('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.d('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.e('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.f('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.g('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.h('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.i('abc&def#ghi')"

    替换17个字符

    下面是类似的代码,可完成相同的操作,但转义的字符更多(\`* _ {}>#+-。!$):

    def a(text):
        chars = "\\`*_{}[]()>#+-.!$"
        for c in chars:
            text = text.replace(c, "\\" + c)
    
    
    def b(text):
        for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
            if ch in text:
                text = text.replace(ch,"\\"+ch)
    
    
    import re
    def c(text):
        rx = re.compile('([&#])')
        text = rx.sub(r'\\\1', text)
    
    
    RX = re.compile('([\\`*_{}[]()>#+-.!$])')
    def d(text):
        text = RX.sub(r'\\\1', text)
    
    
    def mk_esc(esc_chars):
        return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
    esc = mk_esc('\\`*_{}[]()>#+-.!$')
    def e(text):
        esc(text)
    
    
    def f(text):
        text = text.replace('\\', '\\\\').replace('`', '\`').replace('*', '\*').replace('_', '\_').replace('{', '\{').replace('}', '\}').replace('[', '\[').replace(']', '\]').replace('(', '\(').replace(')', '\)').replace('>', '\>').replace('#', '\#').replace('+', '\+').replace('-', '\-').replace('.', '\.').replace('!', '\!').replace('$', '\$')
    
    
    def g(text):
        replacements = {
            "\\": "\\\\",
            "`": "\`",
            "*": "\*",
            "_": "\_",
            "{": "\{",
            "}": "\}",
            "[": "\[",
            "]": "\]",
            "(": "\(",
            ")": "\)",
            ">": "\>",
            "#": "\#",
            "+": "\+",
            "-": "\-",
            ".": "\.",
            "!": "\!",
            "$": "\$",
        }
        text = "".join([replacements.get(c, c) for c in text])
    
    
    def h(text):
        text = text.replace('\\', r'\\')
        text = text.replace('`', r'\`')
        text = text.replace('*', r'\*')
        text = text.replace('_', r'\_')
        text = text.replace('{', r'\{')
        text = text.replace('}', r'\}')
        text = text.replace('[', r'\[')
        text = text.replace(']', r'\]')
        text = text.replace('(', r'\(')
        text = text.replace(')', r'\)')
        text = text.replace('>', r'\>')
        text = text.replace('#', r'\#')
        text = text.replace('+', r'\+')
        text = text.replace('-', r'\-')
        text = text.replace('.', r'\.')
        text = text.replace('!', r'\!')
        text = text.replace('$', r'\$')
    
    
    def i(text):
        text = text.replace('\\', r'\\').replace('`', r'\`').replace('*', r'\*').replace('_', r'\_').replace('{', r'\{').replace('}', r'\}').replace('[', r'\[').replace(']', r'\]').replace('(', r'\(').replace(')', r'\)').replace('>', r'\>').replace('#', r'\#').replace('+', r'\+').replace('-', r'\-').replace('.', r'\.').replace('!', r'\!').replace('$', r'\$')

    这是相同输入字符串的结果abc&def#ghi

    • a)100000个循环,每个循环最好3:6.72μs
    • b)100000个循环,每个循环最好3:2.64μs
    • c)100000个循环,每个循环最好3:11.9μs
    • d)100000个循环,每个循环的最佳时间为3:4.92μs
    • e)100000个循环,每个循环最好为3:2.96μs
    • f)100000个循环,每个循环最好为3:4.29μs
    • g)100000个循环,每个循环最好为3:4.68μs
    • h)100000次循环,每循环3:4.73μs最佳
    • i)100000个循环,每个循环最好为3:4.24μs

    并使用更长的输入字符串(## *Something* and [another] thing in a longer sentence with {more} things to replace$):

    • a)100000个循环,每个循环的最佳时间为3:7.59μs
    • b)100000个循环,每个循环最好为3:6.54μs
    • c)100000个循环,每个循环最好3:16.9μs
    • d)100000个循环,每个循环最好3.:7.29μs
    • e)100000个循环,每个循环最好为3:12.2μs
    • f)100000个循环,每个循环最好为3:3.38μs
    • g)10000个循环,每个循环最好3:21.7μs
    • h)100000个循环,最佳3:每个循环5.7μs
    • i)100000个循环,每个循环中最好为3:5.13μs

    添加几个变体:

    def ab(text):
        for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
            text = text.replace(ch,"\\"+ch)
    
    
    def ba(text):
        chars = "\\`*_{}[]()>#+-.!$"
        for c in chars:
            if c in text:
                text = text.replace(c, "\\" + c)

    输入较短时:

    • ab)100000个循环,每个循环中最好为3:7.05μs
    • ba)100000个循环,最佳3:每个循环2.4μs

    输入较长时:

    • ab)100000个循环,每个循环最好为3:3.71μs
    • ba)100000次循环,每循环最好3:6.08μs

    因此,我将使用ba其可读性和速度。

    附录

    在注释中出现提示时,ab和之间的区别baif c in text:检查。让我们针对另外两个变体进行测试:

    def ab_with_check(text):
        for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
            if ch in text:
                text = text.replace(ch,"\\"+ch)
    
    def ba_without_check(text):
        chars = "\\`*_{}[]()>#+-.!$"
        for c in chars:
            text = text.replace(c, "\\" + c)

    在Python 2.7.14和3.6.3上以及与先前设置不同的计算机上,每个循环的时间以μs为单位。

    ╭────────────╥──────┬───────────────┬──────┬──────────────────╮
     Py, input    ab   ab_with_check   ba   ba_without_check 
    ╞════════════╬══════╪═══════════════╪══════╪══════════════════╡
     Py2, short  8.81     4.22        3.45     8.01          
     Py3, short  5.54     1.34        1.46     5.34          
    ├────────────╫──────┼───────────────┼──────┼──────────────────┤
     Py2, long   9.3      7.15        6.85     8.55          
     Py3, long   7.43     4.38        4.41     7.02          
    └────────────╨──────┴───────────────┴──────┴──────────────────┘

    我们可以得出以下结论:

    • 有支票的人的速度比没有支票的人快4倍

    • ab_with_check在Python 3上稍有领先,但是ba(带检查)在Python 2上有更大的领先优势

    • 但是,这里最大的教训是Python 3比Python 2快3倍!Python 3上最慢的速度与Python 2上最快的速度之间并没有太大的区别!

    Replacing two characters

    I timed all the methods in the current answers along with one extra.

    With an input string of abc&def#ghi and replacing & -> \& and # -> \#, the fastest way was to chain together the replacements like this: text.replace('&', '\&').replace('#', '\#').

    Timings for each function:

    • a) 1000000 loops, best of 3: 1.47 μs per loop
    • b) 1000000 loops, best of 3: 1.51 μs per loop
    • c) 100000 loops, best of 3: 12.3 μs per loop
    • d) 100000 loops, best of 3: 12 μs per loop
    • e) 100000 loops, best of 3: 3.27 μs per loop
    • f) 1000000 loops, best of 3: 0.817 μs per loop
    • g) 100000 loops, best of 3: 3.64 μs per loop
    • h) 1000000 loops, best of 3: 0.927 μs per loop
    • i) 1000000 loops, best of 3: 0.814 μs per loop

    Here are the functions:

    def a(text):
        chars = "&#"
        for c in chars:
            text = text.replace(c, "\\" + c)
    
    
    def b(text):
        for ch in ['&','#']:
            if ch in text:
                text = text.replace(ch,"\\"+ch)
    
    
    import re
    def c(text):
        rx = re.compile('([&#])')
        text = rx.sub(r'\\\1', text)
    
    
    RX = re.compile('([&#])')
    def d(text):
        text = RX.sub(r'\\\1', text)
    
    
    def mk_esc(esc_chars):
        return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
    esc = mk_esc('&#')
    def e(text):
        esc(text)
    
    
    def f(text):
        text = text.replace('&', '\&').replace('#', '\#')
    
    
    def g(text):
        replacements = {"&": "\&", "#": "\#"}
        text = "".join([replacements.get(c, c) for c in text])
    
    
    def h(text):
        text = text.replace('&', r'\&')
        text = text.replace('#', r'\#')
    
    
    def i(text):
        text = text.replace('&', r'\&').replace('#', r'\#')
    

    Timed like this:

    python -mtimeit -s"import time_functions" "time_functions.a('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.b('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.c('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.d('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.e('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.f('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.g('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.h('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.i('abc&def#ghi')"
    

    Replacing 17 characters

    Here’s similar code to do the same but with more characters to escape (\`*_{}>#+-.!$):

    def a(text):
        chars = "\\`*_{}[]()>#+-.!$"
        for c in chars:
            text = text.replace(c, "\\" + c)
    
    
    def b(text):
        for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
            if ch in text:
                text = text.replace(ch,"\\"+ch)
    
    
    import re
    def c(text):
        rx = re.compile('([&#])')
        text = rx.sub(r'\\\1', text)
    
    
    RX = re.compile('([\\`*_{}[]()>#+-.!$])')
    def d(text):
        text = RX.sub(r'\\\1', text)
    
    
    def mk_esc(esc_chars):
        return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
    esc = mk_esc('\\`*_{}[]()>#+-.!$')
    def e(text):
        esc(text)
    
    
    def f(text):
        text = text.replace('\\', '\\\\').replace('`', '\`').replace('*', '\*').replace('_', '\_').replace('{', '\{').replace('}', '\}').replace('[', '\[').replace(']', '\]').replace('(', '\(').replace(')', '\)').replace('>', '\>').replace('#', '\#').replace('+', '\+').replace('-', '\-').replace('.', '\.').replace('!', '\!').replace('$', '\$')
    
    
    def g(text):
        replacements = {
            "\\": "\\\\",
            "`": "\`",
            "*": "\*",
            "_": "\_",
            "{": "\{",
            "}": "\}",
            "[": "\[",
            "]": "\]",
            "(": "\(",
            ")": "\)",
            ">": "\>",
            "#": "\#",
            "+": "\+",
            "-": "\-",
            ".": "\.",
            "!": "\!",
            "$": "\$",
        }
        text = "".join([replacements.get(c, c) for c in text])
    
    
    def h(text):
        text = text.replace('\\', r'\\')
        text = text.replace('`', r'\`')
        text = text.replace('*', r'\*')
        text = text.replace('_', r'\_')
        text = text.replace('{', r'\{')
        text = text.replace('}', r'\}')
        text = text.replace('[', r'\[')
        text = text.replace(']', r'\]')
        text = text.replace('(', r'\(')
        text = text.replace(')', r'\)')
        text = text.replace('>', r'\>')
        text = text.replace('#', r'\#')
        text = text.replace('+', r'\+')
        text = text.replace('-', r'\-')
        text = text.replace('.', r'\.')
        text = text.replace('!', r'\!')
        text = text.replace('$', r'\$')
    
    
    def i(text):
        text = text.replace('\\', r'\\').replace('`', r'\`').replace('*', r'\*').replace('_', r'\_').replace('{', r'\{').replace('}', r'\}').replace('[', r'\[').replace(']', r'\]').replace('(', r'\(').replace(')', r'\)').replace('>', r'\>').replace('#', r'\#').replace('+', r'\+').replace('-', r'\-').replace('.', r'\.').replace('!', r'\!').replace('$', r'\$')
    

    Here’s the results for the same input string abc&def#ghi:

    • a) 100000 loops, best of 3: 6.72 μs per loop
    • b) 100000 loops, best of 3: 2.64 μs per loop
    • c) 100000 loops, best of 3: 11.9 μs per loop
    • d) 100000 loops, best of 3: 4.92 μs per loop
    • e) 100000 loops, best of 3: 2.96 μs per loop
    • f) 100000 loops, best of 3: 4.29 μs per loop
    • g) 100000 loops, best of 3: 4.68 μs per loop
    • h) 100000 loops, best of 3: 4.73 μs per loop
    • i) 100000 loops, best of 3: 4.24 μs per loop

    And with a longer input string (## *Something* and [another] thing in a longer sentence with {more} things to replace$):

    • a) 100000 loops, best of 3: 7.59 μs per loop
    • b) 100000 loops, best of 3: 6.54 μs per loop
    • c) 100000 loops, best of 3: 16.9 μs per loop
    • d) 100000 loops, best of 3: 7.29 μs per loop
    • e) 100000 loops, best of 3: 12.2 μs per loop
    • f) 100000 loops, best of 3: 5.38 μs per loop
    • g) 10000 loops, best of 3: 21.7 μs per loop
    • h) 100000 loops, best of 3: 5.7 μs per loop
    • i) 100000 loops, best of 3: 5.13 μs per loop

    Adding a couple of variants:

    def ab(text):
        for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
            text = text.replace(ch,"\\"+ch)
    
    
    def ba(text):
        chars = "\\`*_{}[]()>#+-.!$"
        for c in chars:
            if c in text:
                text = text.replace(c, "\\" + c)
    

    With the shorter input:

    • ab) 100000 loops, best of 3: 7.05 μs per loop
    • ba) 100000 loops, best of 3: 2.4 μs per loop

    With the longer input:

    • ab) 100000 loops, best of 3: 7.71 μs per loop
    • ba) 100000 loops, best of 3: 6.08 μs per loop

    So I’m going to use ba for readability and speed.

    Addendum

    Prompted by haccks in the comments, one difference between ab and ba is the if c in text: check. Let’s test them against two more variants:

    def ab_with_check(text):
        for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
            if ch in text:
                text = text.replace(ch,"\\"+ch)
    
    def ba_without_check(text):
        chars = "\\`*_{}[]()>#+-.!$"
        for c in chars:
            text = text.replace(c, "\\" + c)
    

    Times in μs per loop on Python 2.7.14 and 3.6.3, and on a different machine from the earlier set, so cannot be compared directly.

    ╭────────────╥──────┬───────────────┬──────┬──────────────────╮
    │ Py, input  ║  ab  │ ab_with_check │  ba  │ ba_without_check │
    ╞════════════╬══════╪═══════════════╪══════╪══════════════════╡
    │ Py2, short ║ 8.81 │    4.22       │ 3.45 │    8.01          │
    │ Py3, short ║ 5.54 │    1.34       │ 1.46 │    5.34          │
    ├────────────╫──────┼───────────────┼──────┼──────────────────┤
    │ Py2, long  ║ 9.3  │    7.15       │ 6.85 │    8.55          │
    │ Py3, long  ║ 7.43 │    4.38       │ 4.41 │    7.02          │
    └────────────╨──────┴───────────────┴──────┴──────────────────┘
    

    We can conclude that:

    • Those with the check are up to 4x faster than those without the check

    • ab_with_check is slightly in the lead on Python 3, but ba (with check) has a greater lead on Python 2

    • However, the biggest lesson here is Python 3 is up to 3x faster than Python 2! There’s not a huge difference between the slowest on Python 3 and fastest on Python 2!


    回答 1

    >>> string="abc&def#ghi"
    >>> for ch in ['&','#']:
    ...   if ch in string:
    ...      string=string.replace(ch,"\\"+ch)
    ...
    >>> print string
    abc\&def\#ghi
    >>> string="abc&def#ghi"
    >>> for ch in ['&','#']:
    ...   if ch in string:
    ...      string=string.replace(ch,"\\"+ch)
    ...
    >>> print string
    abc\&def\#ghi
    

    回答 2

    replace像这样简单地链接功能

    strs = "abc&def#ghi"
    print strs.replace('&', '\&').replace('#', '\#')
    # abc\&def\#ghi

    如果替代品的数量会更多,您可以采用这种通用方式

    strs, replacements = "abc&def#ghi", {"&": "\&", "#": "\#"}
    print "".join([replacements.get(c, c) for c in strs])
    # abc\&def\#ghi

    Simply chain the replace functions like this

    strs = "abc&def#ghi"
    print strs.replace('&', '\&').replace('#', '\#')
    # abc\&def\#ghi
    

    If the replacements are going to be more in number, you can do this in this generic way

    strs, replacements = "abc&def#ghi", {"&": "\&", "#": "\#"}
    print "".join([replacements.get(c, c) for c in strs])
    # abc\&def\#ghi
    

    回答 3

    这是使用str.translate和的python3方法str.maketrans

    s = "abc&def#ghi"
    print(s.translate(str.maketrans({'&': '\&', '#': '\#'})))

    打印的字符串是abc\&def\#ghi

    Here is a python3 method using str.translate and str.maketrans:

    s = "abc&def#ghi"
    print(s.translate(str.maketrans({'&': '\&', '#': '\#'})))
    

    The printed string is abc\&def\#ghi.


    回答 4

    您是否总是要加一个反斜杠?如果是这样,请尝试

    import re
    rx = re.compile('([&#])')
    #                  ^^ fill in the characters here.
    strs = rx.sub('\\\\\\1', strs)

    它可能不是最有效的方法,但我认为这是最简单的方法。

    Are you always going to prepend a backslash? If so, try

    import re
    rx = re.compile('([&#])')
    #                  ^^ fill in the characters here.
    strs = rx.sub('\\\\\\1', strs)
    

    It may not be the most efficient method but I think it is the easiest.


    回答 5

    晚会晚了,但是我在这个问题上浪费了很多时间,直到找到答案。

    简短而甜美,translate优于replace。如果您对随时间推移进行的功能优化更感兴趣,请不要使用replace

    还可以使用translate,如果你不知道,如果重叠设置的用于替换的字符将被替换的字符集。

    例子:

    使用replace您会天真地希望代码片段"1234".replace("1", "2").replace("2", "3").replace("3", "4")返回"2344",但实际上它将返回"4444"

    翻译似乎可以执行OP最初想要的功能。

    Late to the party, but I lost a lot of time with this issue until I found my answer.

    Short and sweet, translate is superior to replace. If you’re more interested in funcionality over time optimization, do not use replace.

    Also use translate if you don’t know if the set of characters to be replaced overlaps the set of characters used to replace.

    Case in point:

    Using replace you would naively expect the snippet "1234".replace("1", "2").replace("2", "3").replace("3", "4") to return "2344", but it will return in fact "4444".

    Translation seems to perform what OP originally desired.


    回答 6

    您可以考虑编写通用的转义函数:

    def mk_esc(esc_chars):
        return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
    
    >>> esc = mk_esc('&#')
    >>> print esc('Learn & be #1')
    Learn \& be \#1

    这样,您就可以使用应转义的字符列表来配置函数。

    You may consider writing a generic escape function:

    def mk_esc(esc_chars):
        return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
    
    >>> esc = mk_esc('&#')
    >>> print esc('Learn & be #1')
    Learn \& be \#1
    

    This way you can make your function configurable with a list of character that should be escaped.


    回答 7

    仅供参考,这对OP几乎没有用处,甚至没有用,但是对其他读者可能有用(请不要投票,我知道这一点)。

    作为一个有点荒谬但有趣的练习,我想看看我是否可以使用python函数式编程来替换多个字符。我很确定这不会打败两次调用replace()。如果性能是一个问题,您可以轻松地在rust,C,julia,perl,java,javascript和awk中击败它。它使用一个名为pytoolz的外部“帮助程序”包,该包通过cython加速(cytoolz,这是一个pypi包)。

    from cytoolz.functoolz import compose
    from cytoolz.itertoolz import chain,sliding_window
    from itertools import starmap,imap,ifilter
    from operator import itemgetter,contains
    text='&hello#hi&yo&'
    char_index_iter=compose(partial(imap, itemgetter(0)), partial(ifilter, compose(partial(contains, '#&'), itemgetter(1))), enumerate)
    print '\\'.join(imap(text.__getitem__, starmap(slice, sliding_window(2, chain((0,), char_index_iter(text), (len(text),))))))

    我什至不解释这一点,因为没有人会费心使用它来完成多次替换。但是,我在执行此操作时感到有些成就,并认为这可能会激发其他读者或赢得代码混淆竞赛。

    FYI, this is of little or no use to the OP but it may be of use to other readers (please do not downvote, I’m aware of this).

    As a somewhat ridiculous but interesting exercise, wanted to see if I could use python functional programming to replace multiple chars. I’m pretty sure this does NOT beat just calling replace() twice. And if performance was an issue, you could easily beat this in rust, C, julia, perl, java, javascript and maybe even awk. It uses an external ‘helpers’ package called pytoolz, accelerated via cython (cytoolz, it’s a pypi package).

    from cytoolz.functoolz import compose
    from cytoolz.itertoolz import chain,sliding_window
    from itertools import starmap,imap,ifilter
    from operator import itemgetter,contains
    text='&hello#hi&yo&'
    char_index_iter=compose(partial(imap, itemgetter(0)), partial(ifilter, compose(partial(contains, '#&'), itemgetter(1))), enumerate)
    print '\\'.join(imap(text.__getitem__, starmap(slice, sliding_window(2, chain((0,), char_index_iter(text), (len(text),))))))
    

    I’m not even going to explain this because no one would bother using this to accomplish multiple replace. Nevertheless, I felt somewhat accomplished in doing this and thought it might inspire other readers or win a code obfuscation contest.


    回答 8

    使用reduce可以在python2.7和python3。*中使用,您可以轻松地以干净的pythonic方式替换多个子字符串。

    # Lets define a helper method to make it easy to use
    def replacer(text, replacements):
        return reduce(
            lambda text, ptuple: text.replace(ptuple[0], ptuple[1]), 
            replacements, text
        )
    
    if __name__ == '__main__':
        uncleaned_str = "abc&def#ghi"
        cleaned_str = replacer(uncleaned_str, [("&","\&"),("#","\#")])
        print(cleaned_str) # "abc\&def\#ghi"

    在python2.7中,您不必导入reduce,但在python3。*中,您必须从functools模块中导入它。

    Using reduce which is available in python2.7 and python3.* you can easily replace mutiple substrings in a clean and pythonic way.

    # Lets define a helper method to make it easy to use
    def replacer(text, replacements):
        return reduce(
            lambda text, ptuple: text.replace(ptuple[0], ptuple[1]), 
            replacements, text
        )
    
    if __name__ == '__main__':
        uncleaned_str = "abc&def#ghi"
        cleaned_str = replacer(uncleaned_str, [("&","\&"),("#","\#")])
        print(cleaned_str) # "abc\&def\#ghi"
    

    In python2.7 you don’t have to import reduce but in python3.* you have to import it from the functools module.


    回答 9

    也许是char替换的简单循环:

    a = '&#'
    
    to_replace = ['&', '#']
    
    for char in to_replace:
        a = a.replace(char, "\\"+char)
    
    print(a)
    
    >>> \&\#

    Maybe a simple loop for chars to replace:

    a = '&#'
    
    to_replace = ['&', '#']
    
    for char in to_replace:
        a = a.replace(char, "\\"+char)
    
    print(a)
    
    >>> \&\#
    

    回答 10

    这个怎么样?

    def replace_all(dict, str):
        for key in dict:
            str = str.replace(key, dict[key])
        return str

    然后

    print(replace_all({"&":"\&", "#":"\#"}, "&#"))

    输出

    \&\#

    类似于答案

    How about this?

    def replace_all(dict, str):
        for key in dict:
            str = str.replace(key, dict[key])
        return str
    

    then

    print(replace_all({"&":"\&", "#":"\#"}, "&#"))
    

    output

    \&\#
    

    similar to answer


    回答 11

    >>> a = '&#'
    >>> print a.replace('&', r'\&')
    \&#
    >>> print a.replace('#', r'\#')
    &\#
    >>> 

    您希望使用“原始”字符串(由替换字符串作为前缀的“ r”表示),因为原始字符串不会特别对待反斜杠。

    >>> a = '&#'
    >>> print a.replace('&', r'\&')
    \&#
    >>> print a.replace('#', r'\#')
    &\#
    >>> 
    

    You want to use a ‘raw’ string (denoted by the ‘r’ prefixing the replacement string), since raw strings to not treat the backslash specially.


    替换Python NumPy数组中所有大于某个值的元素

    问题:替换Python NumPy数组中所有大于某个值的元素

    我有一个2D NumPy数组,并希望将大于或等于阈值T的所有值替换为255.0。据我所知,最基本的方法是:

    shape = arr.shape
    result = np.zeros(shape)
    for x in range(0, shape[0]):
        for y in range(0, shape[1]):
            if arr[x, y] >= T:
                result[x, y] = 255
    
    1. 什么是最简洁,最pythonic的方法?

    2. 有更快的方法(可能不太简洁和/或更少的pythonic)来做到这一点吗?

    这将是用于人头MRI扫描的窗口/水平调整子程序的一部分。2D numpy数组是图像像素数据。

    I have a 2D NumPy array and would like to replace all values in it greater than or equal to a threshold T with 255.0. To my knowledge, the most fundamental way would be:

    shape = arr.shape
    result = np.zeros(shape)
    for x in range(0, shape[0]):
        for y in range(0, shape[1]):
            if arr[x, y] >= T:
                result[x, y] = 255
    
    1. What is the most concise and pythonic way to do this?

    2. Is there a faster (possibly less concise and/or less pythonic) way to do this?

    This will be part of a window/level adjustment subroutine for MRI scans of the human head. The 2D numpy array is the image pixel data.


    回答 0

    我认为最快和最简洁的方法是使用NumPy内置的Fancy indexing。如果您具有ndarraynamed arr,则可以将所有元素替换>255为一个值x,如下所示:

    arr[arr > 255] = x

    我使用500 x 500随机矩阵在计算机上运行此命令,将所有> 0.5的值替换为5,平均花费了7.59ms。

    In [1]: import numpy as np
    In [2]: A = np.random.rand(500, 500)
    In [3]: timeit A[A > 0.5] = 5
    100 loops, best of 3: 7.59 ms per loop
    

    I think both the fastest and most concise way to do this is to use NumPy’s built-in Fancy indexing. If you have an ndarray named arr, you can replace all elements >255 with a value x as follows:

    arr[arr > 255] = x
    

    I ran this on my machine with a 500 x 500 random matrix, replacing all values >0.5 with 5, and it took an average of 7.59ms.

    In [1]: import numpy as np
    In [2]: A = np.random.rand(500, 500)
    In [3]: timeit A[A > 0.5] = 5
    100 loops, best of 3: 7.59 ms per loop
    

    回答 1

    由于您实际上想要的是arrwhere 的其他数组arr < 255255否则可以简单地完成此操作:

    result = np.minimum(arr, 255)

    更一般而言,对于下限和/或上限:

    result = np.clip(arr, 0, 255)

    如果您只想访问超过255的值,或者更复杂的值,则@ mtitan8的回答更为笼统,但对于您的情况,np.clipand和np.minimum(或np.maximum)更好,更快:

    In [292]: timeit np.minimum(a, 255)
    100000 loops, best of 3: 19.6 µs per loop
    
    In [293]: %%timeit
       .....: c = np.copy(a)
       .....: c[a>255] = 255
       .....: 
    10000 loops, best of 3: 86.6 µs per loop
    

    如果您想就地进行操作(即修改arr而不是创建result),则可以使用out参数np.minimum

    np.minimum(arr, 255, out=arr)

    要么

    np.clip(arr, 0, 255, arr)

    out=名称是可选的,因为参数与函数定义的顺序相同。)

    对于就地修改,布尔索引可以提高很多速度(无需分别制作然后修改副本),但是仍然不如minimum

    In [328]: %%timeit
       .....: a = np.random.randint(0, 300, (100,100))
       .....: np.minimum(a, 255, a)
       .....: 
    100000 loops, best of 3: 303 µs per loop
    
    In [329]: %%timeit
       .....: a = np.random.randint(0, 300, (100,100))
       .....: a[a>255] = 255
       .....: 
    100000 loops, best of 3: 356 µs per loop
    

    为了进行比较,如果您想限制最小值和最大值,而无需clip两次,例如

    np.minimum(a, 255, a)
    np.maximum(a, 0, a)
    

    要么,

    a[a>255] = 255
    a[a<0] = 0
    

    Since you actually want a different array which is arr where arr < 255, and 255 otherwise, this can be done simply:

    result = np.minimum(arr, 255)
    

    More generally, for a lower and/or upper bound:

    result = np.clip(arr, 0, 255)
    

    If you just want to access the values over 255, or something more complicated, @mtitan8’s answer is more general, but np.clip and np.minimum (or np.maximum) are nicer and much faster for your case:

    In [292]: timeit np.minimum(a, 255)
    100000 loops, best of 3: 19.6 µs per loop
    
    In [293]: %%timeit
       .....: c = np.copy(a)
       .....: c[a>255] = 255
       .....: 
    10000 loops, best of 3: 86.6 µs per loop
    

    If you want to do it in-place (i.e., modify arr instead of creating result) you can use the out parameter of np.minimum:

    np.minimum(arr, 255, out=arr)
    

    or

    np.clip(arr, 0, 255, arr)
    

    (the out= name is optional since the arguments in the same order as the function’s definition.)

    For in-place modification, the boolean indexing speeds up a lot (without having to make and then modify the copy separately), but is still not as fast as minimum:

    In [328]: %%timeit
       .....: a = np.random.randint(0, 300, (100,100))
       .....: np.minimum(a, 255, a)
       .....: 
    100000 loops, best of 3: 303 µs per loop
    
    In [329]: %%timeit
       .....: a = np.random.randint(0, 300, (100,100))
       .....: a[a>255] = 255
       .....: 
    100000 loops, best of 3: 356 µs per loop
    

    For comparison, if you wanted to restrict your values with a minimum as well as a maximum, without clip you would have to do this twice, with something like

    np.minimum(a, 255, a)
    np.maximum(a, 0, a)
    

    or,

    a[a>255] = 255
    a[a<0] = 0
    

    回答 2

    我认为您可以使用以下where功能最快地实现此目的:

    例如,在numpy数组中查找大于0.2的项并将其替换为0:

    import numpy as np
    
    nums = np.random.rand(4,3)
    
    print np.where(nums > 0.2, 0, nums)
    

    I think you can achieve this the quickest by using the where function:

    For example looking for items greater than 0.2 in a numpy array and replacing those with 0:

    import numpy as np
    
    nums = np.random.rand(4,3)
    
    print np.where(nums > 0.2, 0, nums)
    

    回答 3

    您可以考虑使用numpy.putmask

    np.putmask(arr, arr>=T, 255.0)

    这是与Numpy内置索引的性能比较:

    In [1]: import numpy as np
    In [2]: A = np.random.rand(500, 500)
    
    In [3]: timeit np.putmask(A, A>0.5, 5)
    1000 loops, best of 3: 1.34 ms per loop
    
    In [4]: timeit A[A > 0.5] = 5
    1000 loops, best of 3: 1.82 ms per loop

    You can consider using numpy.putmask:

    np.putmask(arr, arr>=T, 255.0)
    

    Here is a performance comparison with the Numpy’s builtin indexing:

    In [1]: import numpy as np
    In [2]: A = np.random.rand(500, 500)
    
    In [3]: timeit np.putmask(A, A>0.5, 5)
    1000 loops, best of 3: 1.34 ms per loop
    
    In [4]: timeit A[A > 0.5] = 5
    1000 loops, best of 3: 1.82 ms per loop
    

    回答 4

    另一种方法是使用np.place它进行就地替换并与多维数组一起使用:

    import numpy as np
    
    # create 2x3 array with numbers 0..5
    arr = np.arange(6).reshape(2, 3)
    
    # replace 0 with -10
    np.place(arr, arr == 0, -10)

    Another way is to use np.place which does in-place replacement and works with multidimentional arrays:

    import numpy as np
    
    # create 2x3 array with numbers 0..5
    arr = np.arange(6).reshape(2, 3)
    
    # replace 0 with -10
    np.place(arr, arr == 0, -10)
    

    回答 5

    你也可以使用&|(和/或)有更多的灵活性:

    介于5到10之间的值: A[(A>5)&(A<10)]

    大于10或小于5的值: A[(A<5)|(A>10)]

    You can also use &, | (and/or) for more flexibility:

    values between 5 and 10: A[(A>5)&(A<10)]

    values greater than 10 or smaller than 5: A[(A<5)|(A>10)]