解码Python字符串中的HTML实体？

Question 1

I’m parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn’t automatically decode for me:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

How can I decode the HTML entities in text to get "£682m" instead of "£682m".

Question 2

Python 3.4+

Use html.unescape():

import html
print(html.unescape('&pound;682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.

Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

For Python 2.6-2.7 it’s in HTMLParser
For Python 3 it’s in html.parser

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

Question 3

Beautiful Soup handles entity conversion. In Beautiful Soup 3, you’ll need to specify the convertEntities argument to the BeautifulSoup constructor (see the ‘Entity Conversion’ section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.

Beautiful Soup 3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>

Beautiful Soup 4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

Question 4

You can use replace_entities from w3lib.html library

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")
£682m

Question 5

Beautiful Soup 4 allows you to set a formatter to your output

If you pass in formatter=None, Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples:

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

Question 6

I had a similar encoding issue. I used the normalize() method. I was getting a Unicode error using the pandas .to_html() method when exporting my data frame to an .html file in another directory. I ended up doing this and it worked…

    import unicodedata

The dataframe object can be whatever you like, let’s call it table…

    table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
    table.index+= 1

encode table data so that we can export it to out .html file in templates folder(this can be whatever location you wish :))

     #this is where the magic happens
     html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')

export normalized string to html file

    file = open("templates/home.html","w") 

    file.write(html_data) 

    file.close()

Reference: unicodedata documentation

Question 7

This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears – Im new to this).

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value

解码Python字符串中的HTML实体？

问题：解码Python字符串中的HTML实体？

回答 0

Python 3.4以上

Python 2.6-3.3

Python 3.4+

Python 2.6-3.3

回答 1

美丽的汤3

美丽汤4

Beautiful Soup 3

Beautiful Soup 4

回答 2

回答 3

回答 4

回答 5

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

SQLAlchemy是否具有与Django的get_or_create等效的功能？

在函数调用中，星号运算符是什么意思？

请求中的URL超过了最大重试次数

将可识别熊猫时区的DateTimeIndex转换为朴素的时间戳，但在特定的时区

在Python中搜索并替换文件中的一行

这个神奇的工具，能自动将.py转换为.exe

解码Python字符串中的HTML实体？

问题：解码Python字符串中的HTML实体？

回答 0

Python 3.4以上

Python 2.6-3.3

Python 3.4+

Python 2.6-3.3

回答 1

美丽的汤3

美丽汤4

Beautiful Soup 3

Beautiful Soup 4

回答 2

回答 3

回答 4

回答 5

相关文章

排行榜展示

文章展示