如何使python解释器正确处理字符串操作中的非ASCII字符？

Question 1

I have a string that looks like so:

6Â 918Â 417Â 712

The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:

s.replace('Â ', '')

That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.

I never quite could understand how to switch between different encodings.

Here’s the code, it really is just the same as above, but now it’s in context. The file is saved as UTF-8 in notepad and has the following header:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

The code:

f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)

It gets no further than s.replace…

Question 2

Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.

See: http://docs.python.org/tutorial/interpreter.html#source-code-encoding

To enable utf-8 source encoding, this would go in one of the top two lines:

# -*- coding: utf-8 -*-

The above is in the docs, but this also works:

# coding: utf-8

Additional considerations:

The source file must be saved using the correct encoding in your text editor as well.
In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.
s.replace(u"Â ", u"") will also fail if s is not a unicode string.
string.replace returns a new string and does not edit in place, so make sure you’re using the return value as well

Question 3

def removeNonAscii(s): return "".join(filter(lambda x: ord(x)<128, s))

edit: my first impulse is always to use a filter, but the generator expression is more memory efficient (and shorter)…

def removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

Keep in mind that this is guaranteed to work with UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).

Question 4

>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'

Question 5

The following code will replace all non ASCII characters with question marks.

"".join([x if ord(x) < 128 else '?' for x in s])

Question 6

Using Regex:

import re

strip_unicode = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')

Question 7

Way too late for an answer, but the original string was in UTF-8 and ‘\xc2\xa0’ is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:

Example (Python 3)

s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE

Output

6Â 918Â 417Â 712
6 918 417 712
6_918_417_712
6-918-417-712

Question 8

#!/usr/bin/env python
# -*- coding: utf-8 -*-

s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "") 
print s

This will print out 6 918 417 712

Question 9

I know it’s an old thread, but I felt compelled to mention the translate method, which is always a good way to replace all character codes above 128 (or other if necessary).

Usage : str.translate(table[, deletechars])

>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )

>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6  918  417  712'

Starting with Python 2.6, you can also set the table to None, and use deletechars to delete the characters you don’t want as in the examples shown in the standard docs at http://docs.python.org/library/stdtypes.html.

With unicode strings, the translation table is not a 256-character string but a dict with the ord() of relevant characters as keys. But anyway getting a proper ascii string from a unicode string is simple enough, using the method mentioned by truppo above, namely : unicode_string.encode(“ascii”, “ignore”)

As a summary, if for some reason you absolutely need to get an ascii string (for instance, when you raise a standard exception with raise Exception, ascii_message ), you can use the following function:

trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
    if isinstance(s, unicode):
        return s.encode('ascii', 'replace')
    else:
        return s.translate(trans_table)

The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by ‘?’. This is often useful, for instance for indexing purposes.

Question 10

s.replace(u'Â ', '')              # u before string is important

and make your .py file unicode.

Question 11

This is a dirty hack, but may work.

s2 = ""
for i in s:
    if ord(i) < 128:
        s2 += i

Question 12

For what it was worth, my character set was utf-8 and I had included the classic “# -*- coding: utf-8 -*-” line.

However, I discovered that I didn’t have Universal Newlines when reading this data from a webpage.

My text had two words, separated by “\r\n“. I was only splitting on the \n and replacing the "\n".

Once I looped through and saw the character set in question, I realized the mistake.

So, it could also be within the ASCII character set, but a character that you didn’t expect.

如何使python解释器正确处理字符串操作中的非ASCII字符？

问题：如何使python解释器正确处理字符串操作中的非ASCII字符？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

示例（Python 3）

输出量

Example (Python 3)

Output

回答 6

回答 7

回答 8

回答 9

回答 10

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

Black-不折不扣的Python代码格式化程序

如何更新Anaconda？

python中的n克，四克，五克，六克？

Requests-一个简单而优雅的HTTP库

熊猫数据框中选定列和计数中值的唯一组合

在Python 3中加速数百万个正则表达式的替换

如何使python解释器正确处理字符串操作中的非ASCII字符？

问题：如何使python解释器正确处理字符串操作中的非ASCII字符？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

示例（Python 3）

输出量

Example (Python 3)

Output

回答 6

回答 7

回答 8

回答 9

回答 10

相关文章

排行榜展示

文章展示