用单个空格替换非ASCII字符

问题:用单个空格替换非ASCII字符

我需要用空格替换所有非ASCII(\ x00- \ x7F)字符。令我惊讶的是,这在Python中并不是一件容易的事,除非我丢失了一些东西。以下功能仅删除所有非ASCII字符:

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

并且该字符将非ASCII字符替换为空格,该空格数量与字符代码点中的字节数相同(即,字符替换为3个空格):

def remove_non_ascii_2(text):

    return re.sub(r'[^\x00-\x7F]',' ', text)

如何用单个空格替换所有非ASCII字符?

无数 类似 SO 问题 地址 的字符 替换 反对 剥离进一步解决所有非ASCII字符不是一个特定的字符。

I need to replace all non-ASCII (\x00-\x7F) characters with a space. I’m surprised that this is not dead-easy in Python, unless I’m missing something. The following function simply removes all non-ASCII characters:

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the character is replaced with 3 spaces):

def remove_non_ascii_2(text):

    return re.sub(r'[^\x00-\x7F]',' ', text)

How can I replace all non-ASCII characters with a single space?

Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.


回答 0

您的''.join()表达式正在过滤,删除所有非ASCII内容;您可以改为使用条件表达式:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

这将一个接一个地处理字符,每个替换字符仍将使用一个空格。

您的正则表达式应仅将连续的非ASCII字符替换为一个空格:

re.sub(r'[^\x00-\x7F]+',' ', text)

注意+那里。

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.


回答 1

对于您来说,您可以获得原始字符串的最相似的表示形式,我建议使用unidecode模块

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

然后,您可以在字符串中使用它:

remove_non_ascii("Ceñía")
Cenia

For you the get the most alike representation of your original string I recommend the unidecode module:

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

Then you can use it in a string:

remove_non_ascii("Ceñía")
Cenia

回答 2

对于字符处理,请使用Unicode字符串:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

但是请注意,如果您的字符串包含分解的Unicode字符(例如,单独的字符和带重音符号的组合),您仍然会遇到问题:

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

回答 3

如果替换字符可以是“?” 而不是空格,那么我建议result = text.encode('ascii', 'replace').decode()

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

结果:

0.7208260721400134
0.009975979187503592

If the replacement character can be ‘?’ instead of a space, then I’d suggest result = text.encode('ascii', 'replace').decode():

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

Results:

0.7208260721400134
0.009975979187503592

回答 4

这个如何?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

What about this one?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

回答 5

作为一种本机且高效的方法,您不需要使用ord字符或对其进行任何循环。只需使用进行编码,ascii然后忽略错误即可。

以下内容将只删除非ASCII字符:

new_string = old_string.encode('ascii',errors='ignore')

现在,如果要替换已删除的字符,请执行以下操作:

final_string = new_string + b' ' * (len(old_string) - len(new_string))

As a native and efficient approach, you don’t need to use ord or any loop over the characters. Just encode with ascii and ignore the errors.

The following will just remove the non-ascii characters:

new_string = old_string.encode('ascii',errors='ignore')

Now if you want to replace the deleted characters just do the following:

final_string = new_string + b' ' * (len(old_string) - len(new_string))

回答 6

可能会提出其他问题,但我正在提供@Alvero的答案(使用unidecode)。我想对字符串进行“常规”删除,即字符串的开头和结尾为空白字符,然后仅将其他空白字符替换为“常规”空格,即

"Ceñíaㅤmañanaㅤㅤㅤㅤ"

"Ceñía mañana"

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

我们首先将所有非unicode空格替换为常规空格(然后重新加入),

''.join((c if unidecode(c) else ' ') for c in s)

然后,我们使用python的正常拆分方法再次拆分,并剥离每个“位”,

(bit.strip() for bit in s.split())

最后,再次将它们重新加入,但是只有当字符串通过if测试时,

' '.join(stripped for stripped in s if stripped)

然后,safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ')正确返回'Ceñía mañana'

Potentially for a different question, but I’m providing my version of @Alvero’s answer (using unidecode). I want to do a “regular” strip on my strings, i.e. the beginning and end of my string for whitespace characters, and then replace only other whitespace characters with a “regular” space, i.e.

"Ceñíaㅤmañanaㅤㅤㅤㅤ"

to

"Ceñía mañana"

,

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

We first replace all non-unicode spaces with a regular space (and join it back again),

''.join((c if unidecode(c) else ' ') for c in s)

And then we split that again, with python’s normal split, and strip each “bit”,

(bit.strip() for bit in s.split())

And lastly join those back again, but only if the string passes an if test,

' '.join(stripped for stripped in s if stripped)

And with that, safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ') correctly returns 'Ceñía mañana'.