如何使python解释器正确处理字符串操作中的非ASCII字符?

问题:如何使python解释器正确处理字符串操作中的非ASCII字符?

我有一个看起来像这样的字符串:

6 918 417 712

修剪此字符串的明确方法(据我了解Python)只是说该字符串在名为的变量中s,我们得到:

s.replace('Â ', '')

这应该够了吧。但是,当然,它抱怨'\xc2'文件blabla.py 中的非ASCII字符未编码。

我从不完全了解如何在不同的编码之间进行切换。

这是代码,它的确与上面的代码相同,但是现在是在上下文中。该文件在记事本中另存为UTF-8,并具有以下标头:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

代码:

f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)

它没有比s.replace… 更进一步

I have a string that looks like so:

6 918 417 712

The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:

s.replace('Â ', '')

That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.

I never quite could understand how to switch between different encodings.

Here’s the code, it really is just the same as above, but now it’s in context. The file is saved as UTF-8 in notepad and has the following header:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

The code:

f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)

It gets no further than s.replace


回答 0

Python 2 ascii用作源文件的默认编码,这意味着您必须在文件顶部指定其他编码,才能在文字中使用非ASCII Unicode字符。Python 3 utf-8用作源文件的默认编码,因此这没有什么问题。

请参阅:http : //docs.python.org/tutorial/interpreter.html#source-code-encoding

要启用utf-8源编码,将在前两行之一进行:

# -*- coding: utf-8 -*-

上面是在文档中,但这也可以:

# coding: utf-8

其他注意事项:

  • 还必须在文本编辑器中使用正确的编码来保存源文件。

  • 在Python 2中,unicode字面量必须在其u前面,就像在s.replace(u"Â ", u"")But 3中一样,只需使用引号即可。在Python 2中,您from __future__ import unicode_literals可以获得Python 3的行为,但是请注意,这会影响整个当前模块。

  • s.replace(u"Â ", u"")如果s不是unicode字符串,也会失败。

  • string.replace 返回一个新的字符串,并且不会在原地编辑,因此请确保您也使用了返回值

Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.

See: http://docs.python.org/tutorial/interpreter.html#source-code-encoding

To enable utf-8 source encoding, this would go in one of the top two lines:

# -*- coding: utf-8 -*-

The above is in the docs, but this also works:

# coding: utf-8

Additional considerations:

  • The source file must be saved using the correct encoding in your text editor as well.

  • In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.

  • s.replace(u"Â ", u"") will also fail if s is not a unicode string.

  • string.replace returns a new string and does not edit in place, so make sure you’re using the return value as well


回答 1

def removeNonAscii(s): return "".join(filter(lambda x: ord(x)<128, s))

编辑:我的第一个冲动总是使用过滤器,但是生成器表达式的存储效率更高(并且更短)…

def removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

请记住,这保证可以使用UTF-8编码(因为多字节字符中的所有字节的最高位都设置为1)。

def removeNonAscii(s): return "".join(filter(lambda x: ord(x)<128, s))

edit: my first impulse is always to use a filter, but the generator expression is more memory efficient (and shorter)…

def removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

Keep in mind that this is guaranteed to work with UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).


回答 2

>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'
>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'

回答 3

以下代码将所有非ASCII字符替换为问号。

"".join([x if ord(x) < 128 else '?' for x in s])

The following code will replace all non ASCII characters with question marks.

"".join([x if ord(x) < 128 else '?' for x in s])

回答 4

使用正则表达式:

import re

strip_unicode = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')

Using Regex:

import re

strip_unicode = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')

回答 5

答案太迟了,但是原始字符串位于UTF-8中,对于NO-BREAK SPACE,’\ xc2 \ xa0’为UTF-8。只需将原始字符串解码为s.decode('utf-8')(\ xa0在Windows-1252或latin-1错误解码时显示为空格:

示例(Python 3)

s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE

输出量

6 918 417 712
6 918 417 712
6_918_417_712
6-918-417-712

Way too late for an answer, but the original string was in UTF-8 and ‘\xc2\xa0’ is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:

Example (Python 3)

s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE

Output

6 918 417 712
6 918 417 712
6_918_417_712
6-918-417-712

回答 6

#!/usr/bin/env python
# -*- coding: utf-8 -*-

s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "") 
print s

这将打印出来 6 918 417 712

#!/usr/bin/env python
# -*- coding: utf-8 -*-

s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "") 
print s

This will print out 6 918 417 712


回答 7

我知道这是一个旧线程,但是我不得不提一下translate方法,这始终是替换128以上所有字符代码(或必要时其他字符)的一种好方法。

用法:str。翻译table [,deletechars]

>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )

>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6  918  417  712'

Python 2.6开始,您还可以将表设置为None,并使用deletechars删除不需要的字符,如http://docs.python.org/library/stdtypes上标准文档中显示的示例中所示。 html

对于unicode字符串,转换表不是256个字符的字符串,而是以相关字符的ord()作为键的字典。但是无论如何,使用上面truppo提到的方法从unicode字符串中获取适当的ascii字符串就足够简单了,即:unicode_string.encode(“ ascii”,“ ignore”)

总结一下,如果由于某种原因您绝对需要获取ascii字符串(例如,当您使用引发标准异常时raise Exception, ascii_message),可以使用以下函数:

trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
    if isinstance(s, unicode):
        return s.encode('ascii', 'replace')
    else:
        return s.translate(trans_table)

翻译的好处是,您实际上可以将重音符号转换为相关的非重音ascii字符,而不是简单地将其删除或用’?’代替。这通常很有用,例如用于索引目的。

I know it’s an old thread, but I felt compelled to mention the translate method, which is always a good way to replace all character codes above 128 (or other if necessary).

Usage : str.translate(table[, deletechars])

>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )

>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6  918  417  712'

Starting with Python 2.6, you can also set the table to None, and use deletechars to delete the characters you don’t want as in the examples shown in the standard docs at http://docs.python.org/library/stdtypes.html.

With unicode strings, the translation table is not a 256-character string but a dict with the ord() of relevant characters as keys. But anyway getting a proper ascii string from a unicode string is simple enough, using the method mentioned by truppo above, namely : unicode_string.encode(“ascii”, “ignore”)

As a summary, if for some reason you absolutely need to get an ascii string (for instance, when you raise a standard exception with raise Exception, ascii_message ), you can use the following function:

trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
    if isinstance(s, unicode):
        return s.encode('ascii', 'replace')
    else:
        return s.translate(trans_table)

The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by ‘?’. This is often useful, for instance for indexing purposes.


回答 8

s.replace(u'Â ', '')              # u before string is important

并使您的.py文件成为unicode。

s.replace(u'Â ', '')              # u before string is important

and make your .py file unicode.


回答 9

这是一个肮脏的骇客,但可能行得通。

s2 = ""
for i in s:
    if ord(i) < 128:
        s2 += i

This is a dirty hack, but may work.

s2 = ""
for i in s:
    if ord(i) < 128:
        s2 += i

回答 10

就其价值而言,我的角色是,utf-8而且我包括了经典的“# -*- coding: utf-8 -*- ”系列。

但是,我发现从网页读取此数据时没有通用换行符。

我的文字有两个词,以“ \r\n” 分隔。我只是拆分\n并替换"\n"

当我遍历并看到有问题的字符集后,我意识到了这个错误。

因此,它也可以在ASCII字符集中,但不是您所期望的字符。

For what it was worth, my character set was utf-8 and I had included the classic “# -*- coding: utf-8 -*-” line.

However, I discovered that I didn’t have Universal Newlines when reading this data from a webpage.

My text had two words, separated by “\r\n“. I was only splitting on the \n and replacing the "\n".

Once I looped through and saw the character set in question, I realized the mistake.

So, it could also be within the ASCII character set, but a character that you didn’t expect.