标签归档:beautifulsoup

如何按类别查找元素

问题:如何按类别查找元素

我在使用Beautifulsoup解析具有“ class”属性的HTML元素时遇到了麻烦。代码看起来像这样

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div["class"] == "stylelistrow"):
        print div

脚本完成后的同一行出现错误。

File "./beautifulcoding.py", line 130, in getlanguage
  if (div["class"] == "stylelistrow"):
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599, in __getitem__
   return self._getAttrMap()[key]
KeyError: 'class'

我如何摆脱这个错误?

I’m having trouble parsing HTML elements with “class” attribute using Beautifulsoup. The code looks like this

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div["class"] == "stylelistrow"):
        print div

I get an error on the same line “after” the script finishes.

File "./beautifulcoding.py", line 130, in getlanguage
  if (div["class"] == "stylelistrow"):
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599, in __getitem__
   return self._getAttrMap()[key]
KeyError: 'class'

How do I get rid of this error?


回答 0

您可以使用BS3优化搜索以仅找到具有给定类的那些div:

mydivs = soup.findAll("div", {"class": "stylelistrow"})

You can refine your search to only find those divs with a given class using BS3:

mydivs = soup.findAll("div", {"class": "stylelistrow"})

回答 1

从文档中:

从Beautiful Soup 4.1.2开始,您可以使用关键字arguments通过CSS类进行搜索 class_

soup.find_all("a", class_="sister")

在这种情况下将是:

soup.find_all("div", class_="stylelistrow")

它也适用于:

soup.find_all("div", class_="stylelistrowone stylelistrowtwo")

From the documentation:

As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

soup.find_all("a", class_="sister")

Which in this case would be:

soup.find_all("div", class_="stylelistrow")

It would also work for:

soup.find_all("div", class_="stylelistrowone stylelistrowtwo")

回答 2

更新:2016在最新版本的beautifulsoup中,方法“ findAll”已重命名为“ find_all”。链接到官方文档

因此答案将是

soup.find_all("html_element", class_="your_class_name")

Update: 2016 In the latest version of beautifulsoup, the method ‘findAll’ has been renamed to ‘find_all’. Link to official documentation

Hence the answer will be

soup.find_all("html_element", class_="your_class_name")

回答 3

特定于BeautifulSoup 3:

soup.findAll('div',
             {'class': lambda x: x 
                       and 'stylelistrow' in x.split()
             }
            )

将找到所有这些:

<div class="stylelistrow">
<div class="stylelistrow button">
<div class="button stylelistrow">

Specific to BeautifulSoup 3:

soup.findAll('div',
             {'class': lambda x: x 
                       and 'stylelistrow' in x.split()
             }
            )

Will find all of these:

<div class="stylelistrow">
<div class="stylelistrow button">
<div class="button stylelistrow">

回答 4

直接的方法是:

soup = BeautifulSoup(sdata)
for each_div in soup.findAll('div',{'class':'stylelist'}):
    print each_div

确保您使用findAll的大小写,而不是findall的大小写

A straight forward way would be :

soup = BeautifulSoup(sdata)
for each_div in soup.findAll('div',{'class':'stylelist'}):
    print each_div

Make sure you take of the casing of findAll, its not findall


回答 5

如何按类别查找元素

我在使用Beautifulsoup解析具有“ class”属性的html元素时遇到了麻烦。

您可以轻松地按一个类别查找,但是如果要按两个类别的相交查找,则要困难一些,

文档(添加重点):

如果要搜索与两个或多个 CSS类匹配的标签,则应使用CSS选择器:

css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

为了清楚起见,此操作仅选择既是删除线又是正文类的p标签。

要在一组类中查找任何交集(不是交集,而是联合),可以给class_关键字参数提供一个列表(从4.1.2开始):

soup = BeautifulSoup(sdata)
class_list = ["stylelistrow"] # can add any other classes to this list.
# will find any divs with any names in class_list:
mydivs = soup.find_all('div', class_=class_list) 

还要注意,findAll已从camelCase重命名为更多Pythonic find_all

How to find elements by class

I’m having trouble parsing html elements with “class” attribute using Beautifulsoup.

You can easily find by one class, but if you want to find by the intersection of two classes, it’s a little more difficult,

From the documentation (emphasis added):

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

To be clear, this selects only the p tags that are both strikeout and body class.

To find for the intersection of any in a set of classes (not the intersection, but the union), you can give a list to the class_ keyword argument (as of 4.1.2):

soup = BeautifulSoup(sdata)
class_list = ["stylelistrow"] # can add any other classes to this list.
# will find any divs with any names in class_list:
mydivs = soup.find_all('div', class_=class_list) 

Also note that findAll has been renamed from the camelCase to the more Pythonic find_all.


回答 6

CSS选择器

单班第一场比赛

soup.select_one('.stylelistrow')

比赛清单

soup.select('.stylelistrow')

复合类(即AND另一类)

soup.select_one('.stylelistrow.otherclassname')
soup.select('.stylelistrow.otherclassname')

复合类名称中的空格例如class = stylelistrow otherclassname用“。”代替。您可以继续添加类。

类列表(或-匹配存在的任何一个

soup.select_one('.stylelistrow, .otherclassname')
soup.select('.stylelistrow, .otherclassname')

bs4 4.7.1 +

innerText包含字符串的特定类

soup.select_one('.stylelistrow:contains("some string")')
soup.select('.stylelistrow:contains("some string")')

具有特定子元素(例如a标签)的特定类

soup.select_one('.stylelistrow:has(a)')
soup.select('.stylelistrow:has(a)')

CSS selectors

single class first match

soup.select_one('.stylelistrow')

list of matches

soup.select('.stylelistrow')

compound class (i.e. AND another class)

soup.select_one('.stylelistrow.otherclassname')
soup.select('.stylelistrow.otherclassname')

Spaces in compound class names e.g. class = stylelistrow otherclassname are replaced with “.”. You can continue to add classes.

list of classes (OR – match whichever present

soup.select_one('.stylelistrow, .otherclassname')
soup.select('.stylelistrow, .otherclassname')

bs4 4.7.1 +

Specific class whose innerText contains a string

soup.select_one('.stylelistrow:contains("some string")')
soup.select('.stylelistrow:contains("some string")')

Specific class which has a certain child element e.g. a tag

soup.select_one('.stylelistrow:has(a)')
soup.select('.stylelistrow:has(a)')

回答 7

从BeautifulSoup 4+开始,

如果您只有一个类名,则只需将类名作为参数传递即可:

mydivs = soup.find_all('div', 'class_name')

或者,如果您有多个类名,只需将类名列表作为参数传递即可:

mydivs = soup.find_all('div', ['class1', 'class2'])

As of BeautifulSoup 4+ ,

If you have a single class name , you can just pass the class name as parameter like :

mydivs = soup.find_all('div', 'class_name')

Or if you have more than one class names , just pass the list of class names as parameter like :

mydivs = soup.find_all('div', ['class1', 'class2'])

回答 8

尝试首先检查div是否具有class属性,如下所示:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs:
    if "class" in div:
        if (div["class"]=="stylelistrow"):
            print div

Try to check if the div has a class attribute first, like this:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs:
    if "class" in div:
        if (div["class"]=="stylelistrow"):
            print div

回答 9

这对我来说可以访问class属性(在beautifulsoup 4上,与文档中所说的相反)。KeyError会返回一个列表,而不是字典。

for hit in soup.findAll(name='span'):
    print hit.contents[1]['class']

This works for me to access the class attribute (on beautifulsoup 4, contrary to what the documentation says). The KeyError comes a list being returned not a dictionary.

for hit in soup.findAll(name='span'):
    print hit.contents[1]['class']

回答 10

以下对我有用

a_tag = soup.find_all("div",class_='full tabpublist')

the following worked for me

a_tag = soup.find_all("div",class_='full tabpublist')

回答 11

这为我工作:

for div in mydivs:
    try:
        clazz = div["class"]
    except KeyError:
        clazz = ""
    if (clazz == "stylelistrow"):
        print div

This worked for me:

for div in mydivs:
    try:
        clazz = div["class"]
    except KeyError:
        clazz = ""
    if (clazz == "stylelistrow"):
        print div

回答 12

或者,我们可以使用lxml,它支持xpath并且非常快!

from lxml import html, etree 

attr = html.fromstring(html_text)#passing the raw html
handles = attr.xpath('//div[@class="stylelistrow"]')#xpath exresssion to find that specific class

for each in handles:
    print(etree.tostring(each))#printing the html as string

Alternatively we can use lxml, it support xpath and very fast!

from lxml import html, etree 

attr = html.fromstring(html_text)#passing the raw html
handles = attr.xpath('//div[@class="stylelistrow"]')#xpath exresssion to find that specific class

for each in handles:
    print(etree.tostring(each))#printing the html as string

回答 13

这应该工作:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div.find(class_ == "stylelistrow"):
        print div

This should work:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div.find(class_ == "stylelistrow"):
        print div

回答 14

其他答案对我不起作用。

在其他答案中,findAll它被用于汤对象本身,但是我需要一种方法,可以通过对从我做完之后获得的对象中提取的特定元素内的对象进行类名查找findAll

如果您要在嵌套的HTML元素中进行搜索以按类名获取对象,请尝试以下操作-

# parse html
page_soup = soup(web_page.read(), "html.parser")

# filter out items matching class name
all_songs = page_soup.findAll("li", "song_item")

# traverse through all_songs
for song in all_songs:

    # get text out of span element matching class 'song_name'
    # doing a 'find' by class name within a specific song element taken out of 'all_songs' collection
    song.find("span", "song_name").text

注意事项:

  1. 我没有明确定义要在’class’属性上进行搜索findAll("li", {"class": "song_item"}),因为它是我要搜索的唯一属性,并且如果您不专门指出要在哪个属性上查找,默认情况下它将搜索class属性。

  2. 当您执行findAllor或时find,生成的对象属于的bs4.element.ResultSet子类list。您可以ResultSet在任意数量的嵌套元素(只要类型为ResultSet)内利用的所有方法进行查找或全部查找。

  3. 我的BS4版本-4.9.1,Python版本-3.8.1

Other answers did not work for me.

In other answers the findAll is being used on the soup object itself, but I needed a way to do a find by class name on objects inside a specific element extracted from the object I obtained after doing findAll.

If you are trying to do a search inside nested HTML elements to get objects by class name, try below –

# parse html
page_soup = soup(web_page.read(), "html.parser")

# filter out items matching class name
all_songs = page_soup.findAll("li", "song_item")

# traverse through all_songs
for song in all_songs:

    # get text out of span element matching class 'song_name'
    # doing a 'find' by class name within a specific song element taken out of 'all_songs' collection
    song.find("span", "song_name").text

Points to note:

  1. I’m not explicitly defining the search to be on ‘class’ attribute findAll("li", {"class": "song_item"}), since it’s the only attribute I’m searching on and it will by default search for class attribute if you don’t exclusively tell which attribute you want to find on.

  2. When you do a findAll or find, the resulting object is of class bs4.element.ResultSet which is a subclass of list. You can utilize all methods of ResultSet, inside any number of nested elements (as long as they are of type ResultSet) to do a find or find all.

  3. My BS4 version – 4.9.1, Python version – 3.8.1


回答 15

以下应该工作

soup.find('span', attrs={'class':'totalcount'})

将“ totalcount”替换为您的类的名称,并将“ span”替换为您要查找的标签。另外,如果您的Class包含多个带空格的名称,只需选择一个并使用即可。

PS这将找到具有给定条件的第一个元素。如果要查找所有元素,则将“ find”替换为“ find_all”。

The following should work

soup.find('span', attrs={'class':'totalcount'})

replace ‘totalcount’ with your class name and ‘span’ with tag you are looking for. Also, if your class contains multiple names with space, just choose one and use.

P.S. This finds the first element with given criteria. If you want to find all elements then replace ‘find’ with ‘find_all’.


UnicodeEncodeError:’ascii’编解码器无法在位置20编码字符u’\ xa0’:序数不在范围内(128)

问题:UnicodeEncodeError:’ascii’编解码器无法在位置20编码字符u’\ xa0’:序数不在范围内(128)

我在处理从不同网页(在不同站点上)获取的文本中的unicode字符时遇到问题。我正在使用BeautifulSoup。

问题是错误并非总是可重现的。它有时可以在某些页面上使用,有时它会通过抛出来发声UnicodeEncodeError。我已经尝试了几乎所有我能想到的东西,但是没有找到任何能正常工作而不抛出某种与Unicode相关的错误的东西。

导致问题的代码部分之一如下所示:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

这是运行上述代码段时在某些字符串上生成的堆栈跟踪:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

我怀疑这是因为某些页面(或更具体地说,来自某些站点的页面)可能已编码,而其他页面可能未编码。所有站点都位于英国,并提供供英国消费的数据-因此,与英语以外的其他任何形式的内部化或文字处理都没有问题。

是否有人对如何解决此问题有任何想法,以便我可以始终如一地解决此问题?

I’m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.

The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.

One of the sections of code that is causing problems is shown below:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

Here is a stack trace produced on SOME strings when the snippet above is run:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption – so there are no issues relating to internalization or dealing with text written in anything other than English.

Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?


回答 0

您需要阅读Python Unicode HOWTO。这个错误是第一个例子

基本上,停止使用str从Unicode转换为编码的文本/字节。

相反,请正确使用.encode()编码字符串:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

或完全以unicode工作。

You need to read the Python Unicode HOWTO. This error is the very first example.

Basically, stop using str to convert from unicode to encoded text / bytes.

Instead, properly use .encode() to encode the string:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

or work entirely in unicode.


回答 1

这是经典的python unicode痛点!考虑以下:

a = u'bats\u00E0'
print a
 => batsà

到目前为止一切都很好,但是如果我们调用str(a),让我们看看会发生什么:

str(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

噢,蘸,那对任何人都不会有好处!要解决该错误,请使用.encode明确编码字节,并告诉python使用哪种编解码器:

a.encode('utf-8')
 => 'bats\xc3\xa0'
print a.encode('utf-8')
 => batsà

Voil \ u00E0!

问题是,当您调用str()时,python使用默认的字符编码来尝试对给定的字节进行编码,在您的情况下,有时表示为unicode字符。要解决此问题,您必须告诉python如何使用.encode(’whatever_unicode’)处理您给它的字符串。大多数时候,使用utf-8应该会很好。

有关此主题的出色论述,请参见Ned Batchelder在PyCon上的演讲:http : //nedbatchelder.com/text/unipain.html

This is a classic python unicode pain point! Consider the following:

a = u'bats\u00E0'
print a
 => batsà

All good so far, but if we call str(a), let’s see what happens:

str(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

Oh dip, that’s not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:

a.encode('utf-8')
 => 'bats\xc3\xa0'
print a.encode('utf-8')
 => batsà

Voil\u00E0!

The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode(‘whatever_unicode’). Most of the time, you should be fine using utf-8.

For an excellent exposition on this topic, see Ned Batchelder’s PyCon talk here: http://nedbatchelder.com/text/unipain.html


回答 2

我发现可以通过优雅的方法删除符号并继续按以下方式将字符串保留为字符串:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

重要的是要注意,使用ignore选项是危险的,因为它会悄悄地从使用它的代码中删除所有对unicode(和国际化)的支持,如下所示(转换unicode):

>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'

I found elegant work around for me to remove symbols and continue to keep string as string in follows:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

It’s important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):

>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'

回答 3

好吧,我尝试了一切,但并没有帮助,在谷歌搜索之后,我发现了以下内容并有所帮助。使用python 2.7。

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

well i tried everything but it did not help, after googling around i figured the following and it helped. python 2.7 is in use.

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

回答 4

导致甚至打印失败的一个细微问题是环境变量设置错误,例如。此处LC_ALL设置为“ C”。在Debian中,他们不鼓励设置它:Locale上的Debian Wiki

$ echo $LANG
en_US.utf8
$ echo $LC_ALL 
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà

A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to “C”. In Debian they discourage setting it: Debian wiki on Locale

$ echo $LANG
en_US.utf8
$ echo $LC_ALL 
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà

回答 5

对我来说,有效的是:

BeautifulSoup(html_text,from_encoding="utf-8")

希望这对某人有帮助。

For me, what worked was:

BeautifulSoup(html_text,from_encoding="utf-8")

Hope this helps someone.


回答 6

实际上,我发现在大多数情况下,仅去除那些字符会更加简单:

s = mystring.decode('ascii', 'ignore')

I’ve actually found that in most of my cases, just stripping out those characters is much simpler:

s = mystring.decode('ascii', 'ignore')

回答 7

问题是您正在尝试打印unicode字符,但是您的终端不支持它。

您可以尝试安装language-pack-en软件包来解决此问题:

sudo apt-get install language-pack-en

它为所有支持的软件包(包括Python)提供英语翻译数据更新。如有必要,请安装其他语言包(取决于您尝试打​​印的字符)。

在某些Linux发行版中,需要确保正确设置了默认的英语语言环境(因此unicode字符可以由shell / terminal处理)。有时,与手动配置相比,它更容易安装。

然后,在编写代码时,请确保在代码中使用正确的编码。

例如:

open(foo, encoding='utf-8')

如果仍然有问题,请仔细检查系统配置,例如:

  • 您的语言环境文件(/etc/default/locale),应包含例如

    LANG="en_US.UTF-8"
    LC_ALL="en_US.UTF-8"

    要么:

    LC_ALL=C.UTF-8
    LANG=C.UTF-8
  • LANG/ LC_CTYPEin shell的值。

  • 通过以下方法检查您的shell支持的语言环境:

    locale -a | grep "UTF-8"

演示新VM中的问题和解决方案。

  1. 初始化和配置VM(例如使用vagrant):

    vagrant init ubuntu/trusty64; vagrant up; vagrant ssh

    请参阅:可用的Ubuntu盒

  2. 打印unicode字符(例如商标符号):

    $ python -c 'print(u"\u2122");'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
  3. 现在安装language-pack-en

    $ sudo apt-get -y install language-pack-en
    The following extra packages will be installed:
      language-pack-en-base
    Generating locales...
      en_GB.UTF-8... /usr/sbin/locale-gen: done
    Generation complete.
  4. 现在应该解决问题:

    $ python -c 'print(u"\u2122");'
    
  5. 否则,请尝试以下命令:

    $ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
    

The problem is that you’re trying to print a unicode character, but your terminal doesn’t support it.

You can try installing language-pack-en package to fix that:

sudo apt-get install language-pack-en

which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you’re trying to print).

On some Linux distributions it’s required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it’s easier to install it, than configuring it manually.

Then when writing the code, make sure you use the right encoding in your code.

For example:

open(foo, encoding='utf-8')

If you’ve still a problem, double check your system configuration, such as:

  • Your locale file (/etc/default/locale), which should have e.g.

    LANG="en_US.UTF-8"
    LC_ALL="en_US.UTF-8"
    

    or:

    LC_ALL=C.UTF-8
    LANG=C.UTF-8
    
  • Value of LANG/LC_CTYPE in shell.

  • Check which locale your shell supports by:

    locale -a | grep "UTF-8"
    

Demonstrating the problem and solution in fresh VM.

  1. Initialize and provision the VM (e.g. using vagrant):

    vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
    

    See: available Ubuntu boxes..

  2. Printing unicode characters (such as trade mark sign like ):

    $ python -c 'print(u"\u2122");'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
    
  3. Now installing language-pack-en:

    $ sudo apt-get -y install language-pack-en
    The following extra packages will be installed:
      language-pack-en-base
    Generating locales...
      en_GB.UTF-8... /usr/sbin/locale-gen: done
    Generation complete.
    
  4. Now problem should be solved:

    $ python -c 'print(u"\u2122");'
    ™
    
  5. Otherwise, try the following command:

    $ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
    ™
    

回答 8

在外壳中:

  1. 通过以下命令查找支持的UTF-8语言环境:

    locale -a | grep "UTF-8"
  2. 在运行脚本之前将其导出,例如:

    export LC_ALL=$(locale -a | grep UTF-8)

    或手动像:

    export LC_ALL=C.UTF-8
  3. 通过打印特殊字符进行测试,例如

    python -c 'print(u"\u2122");'

以上在Ubuntu中测试过。

In shell:

  1. Find supported UTF-8 locale by the following command:

    locale -a | grep "UTF-8"
    
  2. Export it, before running the script, e.g.:

    export LC_ALL=$(locale -a | grep UTF-8)
    

    or manually like:

    export LC_ALL=C.UTF-8
    
  3. Test it by printing special character, e.g. :

    python -c 'print(u"\u2122");'
    

Above tested in Ubuntu.


回答 9

在脚本开头的下面添加一行(或作为第二行):

# -*- coding: utf-8 -*-

那就是python源代码编码的定义。PEP 263中的更多信息。

Add line below at the beginning of your script ( or as second line):

# -*- coding: utf-8 -*-

That’s definition of python source code encoding. More info in PEP 263.


回答 10

这是其他一些所谓的“警惕”答案的重新表述。在某些情况下,尽管有人在这里提出抗议,但简单地扔掉麻烦的字符/字符串是一个很好的解决方案。

def safeStr(obj):
    try: return str(obj)
    except UnicodeEncodeError:
        return obj.encode('ascii', 'ignore').decode('ascii')
    except: return ""

测试它:

if __name__ == '__main__': 
    print safeStr( 1 ) 
    print safeStr( "test" ) 
    print u'98\xb0'
    print safeStr( u'98\xb0' )

结果:

1
test
98°
98

建议:您可能想将此函数命名为toAscii?这是优先事项。

这是为Python 2编写的。 对于Python 3,我相信您将要使用bytes(obj,"ascii")而不是str(obj)。我尚未对此进行测试,但是我会在某个时候修改答案。

Here’s a rehashing of some other so-called “cop out” answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.

def safeStr(obj):
    try: return str(obj)
    except UnicodeEncodeError:
        return obj.encode('ascii', 'ignore').decode('ascii')
    except: return ""

Testing it:

if __name__ == '__main__': 
    print safeStr( 1 ) 
    print safeStr( "test" ) 
    print u'98\xb0'
    print safeStr( u'98\xb0' )

Results:

1
test
98°
98

Suggestion: you might want to name this function to toAscii instead? That’s a matter of preference.

This was written for Python 2. For Python 3, I believe you’ll want to use bytes(obj,"ascii") rather than str(obj). I didn’t test this yet, but I will at some point and revise the answer.


回答 11

我总是将代码放在python文件的前两行中:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

I always put the code below in the first two lines of the python files:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

回答 12

这里可以找到简单的帮助程序功能。

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')

Simple helper functions found here.

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')

回答 13

只需添加到变量encode(’utf-8’)

agent_contact.encode('utf-8')

Just add to a variable encode(‘utf-8’)

agent_contact.encode('utf-8')

回答 14

请打开终端并执行以下命令:

export LC_ALL="en_US.UTF-8"

Please open terminal and fire the below command:

export LC_ALL="en_US.UTF-8"

回答 15

我只使用了以下内容:

import unicodedata
message = unicodedata.normalize("NFKD", message)

查看有关它的文档说明:

unicodedata.normalize(form,unistr)返回Unicode字符串unistr的普通形式form。格式的有效值为“ NFC”,“ NFKC”,“ NFD”和“ NFKD”。

Unicode标准基于规范对等和兼容性对等的定义,定义了Unicode字符串的各种规范化形式。在Unicode中,可以用各种方式表示几个字符。例如,字符U + 00C7(带有CEDILLA的拉丁文大写字母C)也可以表示为序列U + 0043(拉丁文的大写字母C)U + 0327(合并CEDILLA)。

对于每个字符,有两种规范形式:规范形式C和规范形式D。规范形式D(NFD)也称为规范分解,将每个字符转换为其分解形式。范式C(NFC)首先应用规范分解,然后再次组成预组合字符。

除了这两种形式,还有基于兼容性对等的两种其他常规形式。在Unicode中,支持某些字符,这些字符通常会与其他字符统一。例如,U + 2160(罗马数字ONE)与U + 0049(拉丁大写字母I)实际上是同一回事。但是,Unicode支持它与现有字符集(例如gb2312)兼容。

普通形式的KD(NFKD)将应用兼容性分解,即用所有等效字符替换它们的等效字符。范式KC(NFKC)首先应用兼容性分解,然后进行规范组合。

即使将两个unicode字符串归一化并在人类读者看来是相同的,但是如果一个字符串包含组合字符而另一个字符串没有组合,则它们可能不相等。

为我解决。简单容易。

I just used the following:

import unicodedata
message = unicodedata.normalize("NFKD", message)

Check what documentation says about it:

unicodedata.normalize(form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

Solves it for me. Simple and easy.


回答 16

下面的解决方案为我工作,刚刚添加

u“字符串”

(将字符串表示为unicode)在我的字符串之前。

result_html = result.to_html(col_space=1, index=False, justify={'right'})

text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report.  Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)

Below solution worked for me, Just added

u “String”

(representing the string as unicode) before my string.

result_html = result.to_html(col_space=1, index=False, justify={'right'})

text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report.  Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)

回答 17

this这至少在Python 3中有效…

Python 3

有时错误在于环境变量中,因此

import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
... 
print(myText.encode('utf-8', errors='ignore'))

在编码中忽略错误的地方。

Alas this works in Python 3 at least…

Python 3

Sometimes the error is in the enviroment variables and enconding so

import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
... 
print(myText.encode('utf-8', errors='ignore'))

where errors are ignored in encoding.


回答 18

我只是遇到了这个问题,而Google带领我来到这里,因此,为了在这里添加一般的解决方案,这对我有用:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

阅读内德的演讲后,我有了这个主意。

不过,我并没有声称完全理解为什么这样做。因此,如果任何人都可以编辑此答案或发表评论以进行解释,我将不胜感激。

I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

I had this idea after reading Ned’s presentation.

I don’t claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I’ll appreciate it.


回答 19

manage.py migrate在带有本地化夹具的Django中运行时,我们遇到了此错误。

我们的资料包含# -*- coding: utf-8 -*-声明,MySQL已为utf8正确配置,而Ubuntu在中具有适当的语言包和值/etc/default/locale

问题只是因为Django容器(我们使用docker)缺少 LANG env var。

在重新运行迁移之前设置LANGen_US.UTF-8并重新启动容器可以解决此问题。

We struck this error when running manage.py migrate in Django with localized fixtures.

Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.

The issue was simply that the Django container (we use docker) was missing the LANG env var.

Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.


回答 20

这里的许多答案(例如,@ agf和@Andbdrew)已经解决了OP问题的最直接方面。

但是,我认为有一个微妙但重要的方面已被很大程度上忽略,这对于像我这样在尝试理解Python编码时最终落到这里的每个人都非常重要:Python 2 vs Python 3字符表示的管理截然不同。我觉得很多困惑与人们在不了解版本的情况下阅读Python编码有关。

我建议有兴趣了解OP问题根本原因的人首先阅读Spolsky对字符表示法和Unicode 介绍,然后转向Python 2和Python 3中的Unicode Batchelder

Many answers here (@agf and @Andbdrew for example) have already addressed the most immediate aspects of the OP question.

However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.

I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky’s introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.


回答 21

尽量避免将变量转换为str(variable)。有时,这可能会导致问题。

避免的简单提示:

try: 
    data=str(data)
except:
    data = data #Don't convert to String

上面的示例还将解决Encode错误。

Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.

Simple tip to avoid :

try: 
    data=str(data)
except:
    data = data #Don't convert to String

The above example will solve Encode error also.


回答 22

如果您有类似的packet_data = "This is data"操作,请在初始化后立即在下一行执行此操作packet_data

unic = u''
packet_data = unic

If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:

unic = u''
packet_data = unic

回答 23

python 3.0及更高版本的更新。在python编辑器中尝试以下操作:

locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8

这会将系统的默认语言环境编码设置为UTF-8格式。

有关更多信息,请参见PEP 538-将传统C语言环境强制为基于UTF-8的语言环境

Update for python 3.0 and later. Try the following in the python editor:

locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8

This sets the system`s default locale encoding to the UTF-8 format.

More can be read here at PEP 538 — Coercing the legacy C locale to a UTF-8 based locale.


回答 24

我遇到了尝试将Unicode字符输出到stdout,但使用sys.stdout.write而不是print的问题(这样我也可以支持将输出输出到其他文件)。

从BeautifulSoup自己的文档中,我使用编解码器库解决了此问题:

import sys
import codecs

def main(fIn, fOut):
    soup = BeautifulSoup(fIn)
    # Do processing, with data including non-ASCII characters
    fOut.write(unicode(soup))

if __name__ == '__main__':
    with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
        with codecs.getwriter('utf-8')(sys.stdout) as fOut:
            main(fIn, fOut)

I had this issue trying to output Unicode characters to stdout, but with sys.stdout.write, rather than print (so that I could support output to a different file as well).

From BeautifulSoup’s own documentation, I solved this with the codecs library:

import sys
import codecs

def main(fIn, fOut):
    soup = BeautifulSoup(fIn)
    # Do processing, with data including non-ASCII characters
    fOut.write(unicode(soup))

if __name__ == '__main__':
    with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
        with codecs.getwriter('utf-8')(sys.stdout) as fOut:
            main(fIn, fOut)

回答 25

当使用Apache部署django项目时,经常会发生此问题。因为Apache在/ etc / sysconfig / httpd中设置环境变量LANG = C。只需打开文件并注释(或更改为您的样式)此设置即可。或使用WSGIDaemonProcess命令的lang选项,在这种情况下,您将能够为不同的虚拟主机设置不同的LANG环境变量。

This problem often happens when a django project deploys using Apache. Because Apache sets environment variable LANG=C in /etc/sysconfig/httpd. Just open the file and comment (or change to your flavior) this setting. Or use the lang option of the WSGIDaemonProcess command, in this case you will be able to set different LANG environment variable to different virtualhosts.


回答 26

推荐的解决方案对我不起作用,我可以忍受所有非ascii字符的转储,因此

s = s.encode('ascii',errors='ignore')

这给我留下了不会抛出错误的东西。

The recommended solution did not work for me, and I could live with dumping all non ascii characters, so

s = s.encode('ascii',errors='ignore')

which left me with something stripped that doesn’t throw errors.


回答 27

这将起作用:

 >>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))

输出:

>>>bats

This will work:

 >>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))

Output:

>>>bats