Python:从字符串中删除\ xa0?

问题:Python:从字符串中删除\ xa0?

我目前正在使用Beautiful Soup解析HTML文件并调用get_text(),但似乎我剩下很多表示空格的\ xa0 Unicode。有没有一种有效的方法可以在Python 2.7中将其全部删除,并将其更改为空格?我想更笼统的问题是,有没有办法删除Unicode格式?

我尝试使用:line = line.replace(u'\xa0',' '),如另一个线程所建议的那样,但是将\ xa0更改为u,所以现在到处都是“ u”。):

编辑:问题似乎已通过解决str.replace(u'\xa0', ' ').encode('utf-8'),但.encode('utf-8')不这样做replace()似乎会导致它吐出甚至更奇怪的字符,例如\ xc2。谁能解释一下?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I’m being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0’s to u’s, so now I have “u”s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?


回答 0

\ xa0实际上是Latin1(ISO 8859-1)中的连续字符,也是chr(160)。您应该将其替换为空格。

string = string.replace(u'\xa0', u' ')

当.encode(’utf-8’)时,它将把unicode编码为utf-8,这意味着每个unicode可以由1到4个字节表示。在这种情况下,\ xa0由2个字节\ xc2 \ xa0表示。

http://docs.python.org/howto/unicode.html上阅读。

请注意:此答案自2012年起,Python仍在继续,您unicodedata.normalize现在应该可以使用

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

When .encode(‘utf-8’), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now


回答 1

Python unicodedata库中有许多有用的东西。功能之一就是它.normalize()

尝试:

new_str = unicodedata.normalize("NFKD", unicode_str)

如果您没有得到想要的结果,请使用上面链接中列出的任何其他方法替换NFKD。

There’s many useful things in Python’s unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don’t get the results you’re after.


回答 2

尝试在行尾使用.strip() line.strip()对我来说效果很好

Try using .strip() at the end of your line line.strip() worked well for me


回答 3

在尝试了几种方法之后,总结一下,这就是我的方法。以下是避免/从解析的HTML字符串中删除\ xa0字符的两种方法。

假设我们的原始html如下:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

因此,让我们尝试清除此HTML字符串:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

上面的代码在字符串中生成这些字符\ xa0。要正确删除它们,我们可以使用两种方法。

方法1(推荐): 第一个是BeautifulSoup的get_text方法,带参数为True, 因此我们的代码变为:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

方法2: 另一个选择是使用python的库unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

我还在此博客上详细介绍了这些方法,您可能想参考这些方法。

After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.

Assume we have our raw html as following:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

So lets try to clean this HTML string:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.

Method # 1 (Recommended): The first one is BeautifulSoup’s get_text method with strip argument as True So our code becomes:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

Method # 2: The other option is to use python’s library unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

I have also detailed these methods on this blog which you may want to refer.


回答 4

试试这个:

string.replace('\\xa0', ' ')

try this:

string.replace('\\xa0', ' ')

回答 5

我遇到了同样的问题,使用python从sqlite3数据库中提取了一些数据。上面的答案对我不起作用(不确定为什么),但是这样做:line = line.decode('ascii', 'ignore')但是,我的目标是删除\ xa0s,而不是用空格替换。

我是从Ned Batchelder的这个超级有用的unicode教程中得到的。

I ran into this same problem pulling some data from a sqlite3 database with python. The above answers didn’t work for me (not sure why), but this did: line = line.decode('ascii', 'ignore') However, my goal was deleting the \xa0s, rather than replacing them with spaces.

I got this from this super-helpful unicode tutorial by Ned Batchelder.


回答 6

我在这里搜索无法打印的字符时遇到了问题。我使用MySQL UTF-8 general_ci并处理波兰语。对于有问题的字符串,我必须按以下步骤进行:

text=text.replace('\xc2\xa0', ' ')

这只是一个快速的解决方法,您可能应该尝试使用正确的编码设置进行操作。

I end up here while googling for the problem with not printable character. I use MySQL UTF-8 general_ci and deal with polish language. For problematic strings I have to procced as follows:

text=text.replace('\xc2\xa0', ' ')

It is just fast workaround and you probablly should try something with right encoding setup.


回答 7

试试这个代码

import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()

Try this code

import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()

回答 8

UTF-8中的0xA0(Unicode)为0xC2A0。.encode('utf8')只会采用您的Unicode 0xA0并替换为UTF-8的0xC2A0。因此,0xC2s的出现……编码并没有取代,正如您现在可能已经意识到的那样。

0xA0 (Unicode) is 0xC2A0 in UTF-8. .encode('utf8') will just take your Unicode 0xA0 and replace with UTF-8’s 0xC2A0. Hence the apparition of 0xC2s… Encoding is not replacing, as you’ve probably realized now.


回答 9

这等效于空格字符,因此将其删除

print(string.strip()) # no more xa0

It’s the equivalent of a space character, so strip it

print(string.strip()) # no more xa0

回答 10

在Beautiful Soup中,您可以传递get_text()strip参数,该参数从文本的开头和结尾去除空白。\xa0如果它出现在字符串的开头或结尾,它将删除或任何其他空格。Beautiful Soup用一个空字符串替换了\xa0,这为我解决了问题。

mytext = soup.get_text(strip=True)

In Beautiful Soup, you can pass get_text() the strip parameter, which strips white space from the beginning and end of the text. This will remove \xa0 or any other white space if it occurs at the start or end of the string. Beautiful Soup replaced an empty string with \xa0 and this solved the problem for me.

mytext = soup.get_text(strip=True)

回答 11

具有正则表达式的通用版本(它将删除所有控制字符):

import re
def remove_control_chart(s):
    return re.sub(r'\\x..', '', s)

Generic version with the regular expression (It will remove all the control characters):

import re
def remove_control_chart(s):
    return re.sub(r'\\x..', '', s)

回答 12

Python会将其识别为空格字符,因此您可以split在不使用args的情况下使用常规空格将其加入:

line = ' '.join(line.split())

Python recognize it like a space character, so you can split it without args and join by a normal whitespace:

line = ' '.join(line.split())